Compare commits

...

1673 Commits

Author SHA1 Message Date
0fabc3ba44 CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165022)
CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165013)

Since we have introduced CUDA aarch64 builds for all cuda versions we need to remove this constraint.
This was missed by https://github.com/pytorch/pytorch/pull/162364

Proper constraint on triton should be:
```
Requires-Dist: triton==3.5.0; platform_system == "Linux"
```

not:
```
Requires-Dist: triton==3.5.0; platform_system == "Linux" and platform_machine == "x86_64"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165013
Approved by: https://github.com/Camyll, https://github.com/nWEIdia, https://github.com/tinglvv

(cherry picked from commit 81dbeb06f4b3eb6c56625ec25d377eb7c7c6c573)

Co-authored-by: atalman <atalman@fb.com>
2025-10-08 21:09:57 -04:00
26e023a973 [MPS] Update OS version in error message (#164949)
[MPS] Update OS version in error message (#164946)

Followup after https://github.com/pytorch/pytorch/pull/159912
Fixes https://github.com/pytorch/pytorch/issues/164943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946
Approved by: https://github.com/Camyll

(cherry picked from commit 01f3a43462da594b65a6c9e8b46c132cd360cea9)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-08 11:11:48 -07:00
6f12be2770 CUDA 13.0 builds fix on Amazon Linux 2023 (#164893)
CUDA 13.0 builds fix on Amazon Linux 2023 (#164870)

During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds

This is related to:
 https://github.com/pytorch/pytorch/issues/152756

Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262

Error:
```
WARNING: There was an error checking the latest version of pip.
+ python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps
    ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcudart.so.13: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module>
    import torch
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
Traceback (most recent call last):
ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages']
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
    main()
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870
Approved by: https://github.com/Camyll


(cherry picked from commit 483f4e0db91166128ad8922d86dc7222338d4ecc)

Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-10-07 19:33:08 -07:00
42f0c2c970 update the baseline data for the operator benchmark (#164789)
update the baseline data for the operator benchmark (#162693)

According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data.
We use the average results from the four runs as the new baseline for the five models.

And add a pull request trigger for the operator benchmark workflow

Benchmarking   Framework | Benchmarking   Module Name | Case Name | tag | run_backward | baseline   old | r1 | r2 | r3 | r4 | avg | speedup
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
PyTorch | add | add_M1_N1_K1_cpu | short | FALSE | 3.9497 | 2.57 | 2.54 | 2.38 | 2.31 | 2.45 | 1.61
PyTorch | functional.hardtanh | functional.hardtanh_dims(512	512)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 67.118 | 50.02 | 49.80 | 46.78 | 48.94 | 48.88 | 1.37
PyTorch | relu6 | relu6_dims(512	512)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 68.739 | 51.17 | 51.19 | 48.07 | 50.42 | 50.21 | 1.37
PyTorch | relu6 | relu6_dims(256	1024)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 69.1875 | 51.97 | 52.77 | 50.00 | 51.24 | 51.50 | 1.34
PyTorch | functional.hardtanh | functional.hardtanh_dims(256	1024)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 67.436 | 50.98 | 51.69 | 49.06 | 49.87 | 50.40 | 1.34

@chuanqi129 @huydhn @desertfire @jainapurva

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693
Approved by: https://github.com/huydhn

(cherry picked from commit f7ea4975abb0aeb0224894f0b54b1f8fd1fa70e3)

Co-authored-by: LifengWang <lifeng.a.wang@intel.com>
2025-10-07 07:10:51 -07:00
b015422da1 fix cpp extension distributed warning spew (#164785)
fix cpp extension distributed warning spew (#162764)

With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead

Test plan

```python
# torchrun --nproc_per_node=2 demo_fix.py

import os
import logging

logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)

import torch
if 'RANK' in os.environ:
    torch.distributed.init_process_group('nccl')

from torch.utils.cpp_extension import _get_cuda_arch_flags
_get_cuda_arch_flags()

print(f"Rank {os.environ.get('RANK', '0')} done")
```

Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be

```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py

W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
[rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override.
Rank 0 done
Rank 1 done
```

But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)`

Then we get

```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
Rank 0 done
Rank 1 done
(source) [marksaroufim@devgpu005]~%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764
Approved by: https://github.com/ezyang, https://github.com/zou3519

(cherry picked from commit f7e83219619a05934a344ca699c33ee69d5a3642)

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-10-06 16:58:36 -07:00
d4c4307032 Fix docker build issue after 164575 (#164779)
Fix docker build issue after 164575 (#164774)

Looks like https://github.com/pytorch/pytorch/pull/164575 introduced an issue.
The command is wrong:
```
conda install -c "whl/nightly" -y python=3.11 conda=25.7.0
```
Should be just using default conda channel:
```
conda install  -y python=3.11 conda=25.7.0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164774
Approved by: https://github.com/Camyll

(cherry picked from commit c1f40d33c89b361a1edad17aa25cfff1ab4014fd)

Co-authored-by: atalman <atalman@fb.com>
2025-10-06 16:56:06 -04:00
3b57315b1b [ROCm] Increase binary build timeout to 5 hours (300 minutes) (#164770)
[ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776)

Despite narrowing down the [FBGEMM_GENAI build to gfx942](https://github.com/pytorch/pytorch/pull/162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of https://github.com/pytorch/pytorch/pull/162880 (which is for release/2.9 branch).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163776
Approved by: https://github.com/jeffdaily


(cherry picked from commit 0ec946a0522748332f42675a4d690ff32d773d42)

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 16:08:40 -04:00
c74f05797d Pin conda version for Docker builds (#164579)
Pin conda version for Docker builds (#164575)

Mitigates https://github.com/pytorch/pytorch/issues/164574
Remove unused CUDA_CHANNEL var - this was used before when we had  pytorch install via conda.

Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164575
Approved by: https://github.com/malfet, https://github.com/Camyll

(cherry picked from commit e40fe634b1a7aa33e278b1404ee02dea12277080)

Co-authored-by: atalman <atalman@fb.com>
2025-10-03 11:44:46 -04:00
fd364580a9 [Cherry-Pick] Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980) (#164508)
* Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980)

Work Around for: https://github.com/pytorch/pytorch/issues/133437

Test plan:
1. Build whl in CI
2. Download
3. Run ``nm -D libtorch_cpu.so | grep "recursive_directory_iterator"``

Test with check_binary_symbols.py:

Success:
```
num_cxx11_symbols: 2326
num_pre_cxx11_symbols: 0
lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
num_statically_linked_symbols (T): 0
```

Fail when using "W" instead of "T" as type calling ``cxx11_statically_linked_symbols = grep_symbols(
        lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="W"
    )`` :
```
num_cxx11_symbols: 2326
num_pre_cxx11_symbols: 0
lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
num_statically_linked_symbols (T): 20
Traceback (most recent call last):
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 130, in <module>
    main()
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 126, in main
    check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 95, in check_lib_statically_linked_libstdc_cxx_abi_symbols
    raise RuntimeError(
RuntimeError: Found statically linked libstdc++ symbols (recursive_directory_iterator), but there shouldn't be any, see: ['std::filesystem::__cxx11::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::__cxx11::recursive_directory_iterator::depth() const', 'std::filesystem::__cxx11::recursive_directory_iterator::options() const', 'std::filesystem::__cxx11::recursive_directory_iterator::operator*() const', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::operator bool() const', 'std::filesystem::__cxx11::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::__cxx11::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::pop()', 'std::filesystem::__cxx11::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator&&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator const&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator++()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()']
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163980
Approved by: https://github.com/isuruf, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

* fix

---------

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-02 17:49:44 -04:00
2f6387e9a1 [CherrryPick][2.9] Cherry pick request for Reapply "Make functionalization ViewMeta serializable with pickle #163769 (#163873)
Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712)"  (#163769)

NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files

(Summary from #161994 )
Attempted rebase of https://github.com/pytorch/pytorch/pull/143712.

This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D81524507

Pulled By: Lucaskabela

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769
Approved by: https://github.com/dolpm


(cherry picked from commit 7d710403b003e44bf31d367673a05468e49df75d)

Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
2025-10-02 16:07:51 -04:00
017d857f5f fix pickling for BitwiseFn (#163861)
* fix pickling for BitwiseFn (#163571)

Summary:
ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn'

looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395

Fixes #147841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571
Approved by: https://github.com/jamesjwu

(cherry picked from commit cde5c9aebd7a2eda0c935de1ab7a40b6453c5813)

* Fix lintrunner with -a

---------

Co-authored-by: dolpm <34420038+dolpm@users.noreply.github.com>
Co-authored-by: Lucas Kabela <lucaskabela@meta.com>
2025-10-02 15:35:40 -04:00
d6e8411889 Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164477)
Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164470)

I believe ``set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0`` is the one thats actually used. Hence remove 6.1  to align the support with Linux support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164470
Approved by: https://github.com/tinglvv, https://github.com/nWEIdia, https://github.com/Camyll

(cherry picked from commit 235b995ce18de632ab816940319fcd66b46039b8)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-10-02 14:33:06 -04:00
10b501fde9 [Flex] Fix silent correctness w/ backpropping grads (#164366)
[Flex] Fix silent correctness w/ backpropping grads (#163677)

Fixes #https://github.com/pytorch/pytorch/issues/162228

# Summary

Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd.  As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check  will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here.

TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677
Approved by: https://github.com/ydwu4

(cherry picked from commit e2ce79e4cce5327b71fcf366fad1133030563285)

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-01 14:43:28 -07:00
31c72b8a96 [a2av] Separate in/out splits into two tensors (#164028)
[a2av] Separate in/out splits into two tensors (#163837)

Old signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)`
New signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)`

i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better.
Also to be in line with the 2D version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837
Approved by: https://github.com/fduwjj
ghstack dependencies: #163886

(cherry picked from commit bbf8aa43efe755b9c310347b3780962fca85bf9c)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-10-01 14:43:19 -07:00
1cd83de315 [Flex attention] Fix flex attention head broadcast (#164368)
[Flex attention] Fix flex attention head broadcast (#163426)

Fixes part of #163314

In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**

This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.

The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg

(cherry picked from commit 1a42656d6c43a9bb7eb90c511884ce451d29422f)

Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
2025-10-01 13:48:10 -07:00
881c2ccae9 Update Gloo submodule (#164371)
Update Gloo submodule (#163112)

Which makes PyTorch buildable with gcc-15, tested by running the build inside `fedora:44` docker
```
docker run --rm -it fedora:44 bash -c "yum install -y g++ python3-devel git; git clone https://github.com/pytorch/pytorch; cd pytorch; git checkout 8f710acce8332979c9a7bf97e72666dfd35c43e6; python3 -mpip install -r requirements.txt; python3 setup.py bdist_wheel"
```

Fixes https://github.com/pytorch/pytorch/issues/156595
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163112
Approved by: https://github.com/huydhn

(cherry picked from commit 65845d72917fc27cd89a88b067e7c8f44bc0c987)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-10-01 12:00:18 -07:00
764f65584a [MPS] Chunk fillBuffer into 4Gb slices (#164370)
[MPS] Chunk fillBuffer into 4Gb slices (#164108)

To avoid regression on MacOS 26, which one could observe by running the following script
```swift
import Metal

let bufferSize = 1<<32 + 4

guard let device = MTLCreateSystemDefaultDevice() else { fatalError("No Metal device found") }
guard let buffer = device.makeBuffer(length: bufferSize, options: .storageModeShared) else { fatalError("Failed to create buffer") }

guard let cmdQueue = device.makeCommandQueue() else { fatalError("Failed to create command queue") }
guard let cmdBuffer = cmdQueue.makeCommandBuffer() else { fatalError("Failed to create command buffer") }
guard let blitEncoder = cmdBuffer.makeBlitCommandEncoder() else { fatalError("Failed to create blit encoder") }

blitEncoder.fill(buffer: buffer, range: 0..<bufferSize, value: 0x42)
blitEncoder.endEncoding()

cmdBuffer.commit()
cmdBuffer.waitUntilCompleted()

let tailOffs = 8
let hostPtr = buffer.contents().bindMemory(to: UInt8.self, capacity: bufferSize)
let tail = Array(UnsafeBufferPointer(start: hostPtr + (bufferSize - tailOffs), count: tailOffs))

for (idx, val) in tail.enumerated() {
    print("Offs 0x\(String(bufferSize - tailOffs + idx, radix: 16)): 0x\(String(val, radix: 16))")
}
```

Test plan: run `test_indexing.py` on MacOS-26

Fixes https://github.com/pytorch/pytorch/issues/161265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164108
Approved by: https://github.com/Skylion007

(cherry picked from commit 6db1b9dd217501e0b3171d96335bed7b2bb53c36)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-10-01 11:59:56 -07:00
3e8a062385 Update Microsoft C++ Redistributable to the latest version (#164369)
Update Microsoft C++ Redistributable to the latest version (#161430)

Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430
Approved by: https://github.com/malfet

(cherry picked from commit 1330c638bef7fac64a42935b5a46ee32637ddd4d)

Co-authored-by: Saman Khatir <saman.khatir@amd.com>
2025-10-01 11:57:53 -07:00
3abee625e1 Fix warn message (#164367)
Fix warn message (#163578)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0

(cherry picked from commit f3f67ff43a014b75b804d5ded0c7de3d8e0be65f)

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-01 11:57:16 -07:00
f227c883f9 [MPSHooks] Release pending command encoder (#164365)
[MPSHooks] Release pending command encoder (#164093)

Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error
```
 tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer'
```

Added regression test to `test_mps_extension`

Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that

Fixes https://github.com/pytorch/pytorch/issues/163721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093
Approved by: https://github.com/atalman, https://github.com/Skylion007

(cherry picked from commit 8f32adc90a7fee83583c9ba89dbdfabb317e0452)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-10-01 11:56:42 -07:00
a5feacb14b [SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#164364)
[SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598)

Fixes #163597

- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.

### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)

Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro

### Vector Fast Path (q_len=1, k_len=256)

- Before: 0.160 ms
- After: 0.157 ms

### Vector 2-pass (q_len=1, k_len=4096)

- Before: 0.342 ms
- After: 0.339 ms

### Vector Fast Path (q_len=8, k_len=256)

- Before: 0.228 ms
- After: 0.231 ms

### Vector 2-pass (q_len=8, k_len=4096)

- Before: 0.432 ms
- After:  0.436 ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet

(cherry picked from commit 1c12d7416bc4f1cf0bc8a229e64169fc361b688e)

Co-authored-by: Vismai Khanderao <59114226+Vismai-Khanderao@users.noreply.github.com>
2025-10-01 11:37:14 -07:00
71282c8364 Update Sphinx theme (#164147) (#164254)
Fix links in the top nav bar: 71e55749be

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164147
Approved by: https://github.com/albanD

(cherry picked from commit e88cca069171ceb117dd1ceb73e8bf3e54aa83cf)
2025-10-01 09:59:45 -07:00
e70d9f5322 [vllm hash update] update the pinned vllm hash (#164190) (#164312)
* [vllm hash update] update the pinned vllm hash (#164190)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190
Approved by: https://github.com/pytorchbot

* Cherry pick b7125b3c456d48445ab0b84fab28702577cd9557

Signed-off-by: Huy Do <huydhn@gmail.com>

---------

Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com>
2025-10-01 06:43:17 -07:00
005e3e8d78 Clean up obsoleted vLLM tests (#164282)
Clean up obsoleted vLLM tests (#163383)

They have been removed in https://github.com/vllm-project/vllm/pull/25117 and https://github.com/vllm-project/vllm/pull/22772, thus failing in trunk at the moment after the latest pin commit update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163383
Approved by: https://github.com/wdvr, https://github.com/seemethere, https://github.com/malfet

(cherry picked from commit a31acf32bd18e115df910002aef42baf7a9b4a33)

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-30 14:40:57 -07:00
72cf48ea43 [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#164236)
[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1   (#163988)

See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark.

Fixes https://github.com/pytorch/pytorch/issues/163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: https://github.com/pytorch/pytorch/pull/119750 and 5c814e2527

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  c6ad34f7eb/python/triton/knobs.py (L216) would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988
Approved by: https://github.com/atalman

(cherry picked from commit 3b4ad4a17d69e2db495ecaf3bae8916282a4eb0d)

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-09-30 13:53:56 -04:00
a21a4bf11a [CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#164182)
[CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#162877)

Related to https://github.com/pytorch/pytorch/pull/162862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162877
Approved by: https://github.com/malfet

(cherry picked from commit c9e57d7e9f326e427fc4ae5c318fd017cd4b75a9)

Co-authored-by: atalman <atalman@fb.com>
2025-09-29 15:52:16 -07:00
21fec65781 Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#164172)
Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956)

Workaround for https://github.com/pytorch/pytorch/issues/163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163956
Approved by: https://github.com/malfet


(cherry picked from commit 349c960970f4e29eff0d37a9b3c1ca5ed86a121a)

Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-09-29 16:14:37 -04:00
22d46b50ec [CUDA] revert PR 130472 (#163379)
[CUDA] revert PR 130472 (#162950)

This change may also resolve https://github.com/pytorch/pytorch/issues/161789, though verification is still needed.

PR #130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

(cherry picked from commit 4a160dae3cabaff358a6bb2490d0160dd1bf2cdf)

Co-authored-by: thenumberouscode <dream20151224@163.com>
2025-09-29 16:05:26 -04:00
d1b63e2b4a Skip test_conv3d_cudnn_broken on ROCM (#164163)
Skip test_conv3d_cudnn_broken on ROCM (#164138)

Followup after https://github.com/pytorch/pytorch/pull/163903  Fixes https://github.com/pytorch/pytorch/issues/164137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164138
Approved by: https://github.com/Camyll

(cherry picked from commit 95be302889b8683b7ec7793a69ffa8891b6b5af8)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-29 11:41:18 -07:00
20100b7210 [c10d] P2P tensors must be dense (#163981)
[c10d] P2P tensors must be dense (#163719)

Fixes #161324
by adding `is_non_overlapping_and_dense` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719
Approved by: https://github.com/ngimel

(cherry picked from commit 11a231ef52841a549913b7a6d423cc9004b6b58b)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-09-29 11:27:24 -07:00
a2c77043ee Add operator benchmarking run to CI nightly (#164151)
Add operator benchmarking run to CI nightly (#162530)

This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.

**Benchmark Configuration and Coverage:**

* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn


(cherry picked from commit 54b38f3b46c33a1cc4e8f7894619358afcbd7c89)

Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-29 11:21:19 -07:00
b64fc8e41e Fix operator benchmark issue#162708 (#164140)
Fix operator benchmark issue#162708 (#162744)

This PR skips memory metric calculation for ops which don't take tensor input, fixing the operator_benchmark bug

Fixes https://github.com/pytorch/pytorch/issues/162708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162744
Approved by: https://github.com/huydhn

(cherry picked from commit 5f66902ecfb9cb4f7b9c50cb86307217cec1dbe9)

Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
2025-09-29 09:34:26 -07:00
709f4f62a0 [cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#164027)
[cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581)

To workaround #163539

Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581
Approved by: https://github.com/ngimel, https://github.com/malfet


(cherry picked from commit e2817ac20426356278502db3b1614ea87cb7cff7)

Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-29 09:07:14 -07:00
11f776c8ee [cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#164026)
[cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903)

cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit ed3085814a870f7a07b7f9c696999a47d4f85376)

Co-authored-by: Eddie Yan <eddiey@nvidia.com>
2025-09-29 09:06:23 -07:00
45e257f046 [cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#164023)
[cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171)

test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171
Approved by: https://github.com/ngimel, https://github.com/malfet

(cherry picked from commit 0ea10f9912a9ec7c6d606bc71e3ec91f20372212)

Co-authored-by: eqy <eddiey@nvidia.com>
2025-09-29 09:03:36 -07:00
37e2626639 Update the operator benchmarking, to benchmark using torch.compile (#164101)
Update the operator benchmarking, to benchmark using torch.compile (#161394)

This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:

- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name

Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser

(cherry picked from commit af60398c3a057506363e028bf328843a755b4f24)

Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
2025-09-29 07:49:05 -07:00
d7a703ea92 [SymmMem] Barrier on team instead of world (#163376)
[SymmMem] Barrier on team instead of world (#163298)

As titled. Avoiding a potential hang when running dispatch and combine in subgroups.

The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin

(cherry picked from commit f8fb437197033c33ecc435cd5e1e6a5b2bc5bf69)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-09-26 16:41:18 -07:00
daa3d04325 [SymmMem] Fix memory allocation hold-up (#163375)
[SymmMem] Fix memory allocation hold-up (#162680)

Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.

Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.

Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298

(cherry picked from commit 7130b174e07dbc1a708934b18dede3d88e8f779f)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-09-26 16:35:56 -07:00
999304396f [dist] handle discontiguous allgather/reducescatter inputs (#163987)
[dist] handle discontiguous allgather/reducescatter inputs (#163712)

Fixes #163483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163712
Approved by: https://github.com/ezyang, https://github.com/kwen2501

(cherry picked from commit 71eec6a0bf69f712f4b9279fdc8d1459be0426e6)

Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
2025-09-26 16:21:08 -07:00
5340e741df [Reland][163423] Promote @requires_nvshmem instead of enable_triton (#163916)
[Reland][163423] Promote `@requires_nvshmem` instead of `enable_triton` (#163549)

#163423 was approved but reverted due to a revert of base.
Relanding without base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549
Approved by: https://github.com/wdvr


(cherry picked from commit 6e6c899347db952f6a691feb4e8610fe9cca0279)

Co-authored-by: Ke Wen <kw2501@fb.com>
Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
2025-09-26 15:58:30 -07:00
7cadf8ac04 [Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163388)
[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)

On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire

(cherry picked from commit 9f8a311af09586ac4026d6a56fc7c4ac7acc62ed)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-09-26 14:42:09 -04:00
f9e495fe8e Move inductor jobs 3.9->3.10 (#163954)
Move inductor jobs 3.9->3.10 (#162323)

Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323
Approved by: https://github.com/huydhn, https://github.com/Skylion007


(cherry picked from commit e8eeb060348f250975124abb957b1d7d9c4af9a0)

Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-26 12:37:50 -04:00
57dc68844d [CI] Fix test_triton_wait_until hang (#163914)
[CI] Fix test_triton_wait_until hang (#163886)

I don't know why `nvshmem_barrier_all_kernel`  leads the test to hang. Will investigate.
But since it is an unnecessary call here, I am removing it to unblock other PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886
Approved by: https://github.com/fegin

(cherry picked from commit 96275dbf88372bb32a123c4ea918498128fbecb9)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-09-26 12:16:00 -04:00
63da9d2730 [Release 2.9] Update torch-xpu-ops commit pin (#163622)
Update commit pin to 789f59
2025-09-26 09:46:02 -04:00
824d59fbf6 [CI] Install libuv for Win testing (#163907)
[CI] Install libuv for Win testing (#163797)

Current working theory why f0078941cf caused a regression, are because Windows CI no longer could be build with distributed, as it could not find libuv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163797
Approved by: https://github.com/wdvr

(cherry picked from commit cc660d38ac533b92f3ad4cb1105f7a16f74b9f09)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-26 00:03:22 -07:00
fc8bf12b38 Fix cpp build (#163887)
Fix cpp build (#162774)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774
Approved by: https://github.com/malfet, https://github.com/atalman

(cherry picked from commit b61bdc7cc4c841bf7574bc993f3fd445682f0997)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-09-25 14:50:59 -07:00
49dab18ecf [CD] Add statically linked windows libraries to exclude list (#163862)
[CD] Add statically linked windows libraries to exclude list (#163768)

Fixes: https://github.com/pytorch/pytorch/issues/159514

Seeing following in the Wheel build logs:
```
Linking CXX static library lib\kineto.lib
Linking CXX static library lib\dnnl.lib
....
```

These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds.

Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768
Approved by: https://github.com/albanD, https://github.com/malfet

(cherry picked from commit 98c4e35f14601909c113b4fd2857b6f0fb525316)

Co-authored-by: atalman <atalman@fb.com>
2025-09-25 14:46:56 -07:00
0154ca1d3d [BE] Update Python min version to 3.10 (#162310) (#163885)
* [BE] Update Python min version to 3.10 (#162310)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi

* comment out executorch

---------

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-09-25 14:44:48 -07:00
132d9fac3b Revert "[BE] Update Python min version to 3.10 (#162310)" (#163882)
Revert "[BE] Update Python min version to 3.10 (#162310) (#163802)"

This reverts commit 7d024a6e299eee2830e9fbdae1913e432160bb23.
2025-09-25 10:54:12 -07:00
87c5d4a858 [cherrypick] [CI] Move Windows build/tests to Python-3.10 #162862 (#163800)
[CI] Move Windows build/tests to Python-3.10 (#162862)

What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
 - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
 - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
 - Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps

I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
ghstack dependencies: #163339, #163341

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-09-25 09:06:52 -07:00
b0dc90881c [CD] Simplify NVIDIA driver installation step (#163349) (#163790)
Undo changes introduced in https://github.com/pytorch/pytorch/pull/160956 as driver has been updated to 580 for both fleets

Fixes https://github.com/pytorch/pytorch/issues/163342
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163349
Approved by: https://github.com/seemethere

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-25 10:40:57 -04:00
c0577aad39 Use cuda nvrtc so file based on cuda version used by torch (#163642) (#163788)
Fixes https://github.com/pytorch/pytorch/issues/162367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642
Approved by: https://github.com/msaroufim
2025-09-25 10:40:09 -04:00
9952b87600 [CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163766)
[CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163661)

Preload logic no longer works with CUDA 13.0
See the installation path:
```
ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/
libcheckpoint.so   libcudadevrt.a      libcufft.so.12   libcufile_rdma.so.1  libcusolver.so.12    libnvJitLink.so.13  libnvperf_target.so            libnvrtc.alt.so.13    libpcsamplingutil.so
libcublas.so.13    libcudart.so.13     libcufftw.so.12  libcupti.so.13       libcusolverMg.so.12  libnvblas.so.13     libnvrtc-builtins.alt.so.13.0  libnvrtc.so.13
libcublasLt.so.13  libcudart_static.a  libcufile.so.0   libcurand.so.10      libcusparse.so.12    libnvperf_host.so   libnvrtc-builtins.so.13.0      libnvtx3interop.so.1

ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/
cu13  cudnn  cusparselt  nccl  nvshmem
```

Test using script from : https://github.com/pytorch/pytorch/issues/162367
```
Kernel test passed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163661
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll

(cherry picked from commit 141fc7276ebc722b6076cc3afe4fbc6307a1b775)

Co-authored-by: atalman <atalman@fb.com>
2025-09-25 10:38:16 -04:00
300bade202 [Cherry-Pick] [CD] CUDA 13 specific followup changes. Remove sm50-70 From CUDA 12.6 and CUDA 12.8 builds (#162455) (#163764)
* [CD] CUDA 13 specific followup changes (#162455)

Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779
sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm.
remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455
Approved by: https://github.com/atalman

* update

---------

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-09-25 10:37:52 -04:00
96f0c0fa07 Fix some edge cases (#163106)
Fix some edge cases (#162295)

``` Summary
🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64)  ┆ 111.08814         ┆ 112.447047           ┆ 1.012233                  ┆ 1.22327   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64)  ┆ 78.23082          ┆ 76.693169            ┆ 0.980345                  ┆ -1.965531 │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663          ┆ 95.573333            ┆ 0.985733                  ┆ -1.426717 │
│ alibi          ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64)  ┆ 93.373473         ┆ 92.294147            ┆ 0.988441                  ┆ -1.155924 │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147          ┆ 96.105389            ┆ 0.991273                  ┆ -0.872685 │
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295
Approved by: https://github.com/mlazos, https://github.com/v0i0

(cherry picked from commit 864ffe12d737403230e8257b9bce0a830bd590c1)

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-09-25 10:29:39 -04:00
7d024a6e29 [BE] Update Python min version to 3.10 (#162310) (#163802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162862

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-09-24 15:48:19 -07:00
be29c5b207 Add analytics ID to cpp docs (#163695)
Add analytics ID to cpp docs (#163370)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163370
Approved by: https://github.com/albanD

(cherry picked from commit e6a9db58d71e474deac28276de1f611638c32eeb)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-09-24 15:45:17 -07:00
5322dab793 Update pytorch.org links in docs/conf.py (#163703)
Update pytorch.org links in docs/conf.py (#163682)

Update links in conf.py to docs.pytorch.org

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682
Approved by: https://github.com/sekyondaMeta, https://github.com/albanD

(cherry picked from commit 8c8416b021e59a5ec58aceb38eeffc63885a28bc)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-09-24 15:44:43 -07:00
1dadb6196b [BE] Introduce CONDA_ROOT_DIR (#163805)
[BE] Introduce `CONDA_ROOT_DIR` (#163341)

Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341
Approved by: https://github.com/clee2000
ghstack dependencies: #163339

(cherry picked from commit a273475b01e912f402378a522bb9c4ed37e8413a)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-09-24 15:42:16 -07:00
6c058c1262 Move ROCM trunk wheel builds to 3.10 (#163804)
Move ROCM trunk wheel builds to 3.10 (#163339)

This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see https://github.com/pytorch/pytorch/pull/162297 ) sometimes in shell script (see https://github.com/pytorch/pytorch/pull/162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163339
Approved by: https://github.com/clee2000

(cherry picked from commit 52dd7a898c117305b4407c7f26bbcc7b39f20aaa)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-09-24 15:41:55 -07:00
715dca6725 [export] Remove .contiguous() when saving weights to raw bytes (#163662)
[export] Remove .contiguous() when saving weights to raw bytes (#163587)

Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing

Differential Revision: D83016250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi

(cherry picked from commit 720a7b2887ca4efc8d63b32373182bc97918c76e)

Co-authored-by: Yiming Zhou <yimingzhou@meta.com>
2025-09-23 10:15:06 -07:00
47cb45e4f6 Update pytorch_sphinx_theme2 to latest hash (#163655)
Update pytorch_sphinx_theme2 to latest hash (#163269)

The updated theme:
- Fixes articleBody in the json+ld that caused previous Google Search issues
- Other minor fixes
- 404.html fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269
Approved by: https://github.com/albanD

(cherry picked from commit 68e75be86ab618bb6b1dc32b603a780ff6046262)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-09-23 10:13:51 -07:00
4966d058f2 CUDA 13.0 Warning update for supported architectures (#163633)
CUDA 13.0 Warning update for supported architectures (#163585)

Please see build script: 8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)

This should display correct warning:
``
Please install PyTorch with a following CUDA
configurations: 12.6 12.8 13.0 following instructions at
https://pytorch.org/get-started/locally/
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585
Approved by: https://github.com/malfet

(cherry picked from commit 3c64b2abab5a23809140da5bd6272307b776e459)

Co-authored-by: atalman <atalman@fb.com>
2025-09-23 10:13:06 -07:00
579794ed7b [SymmMem] Fix put_signal + wait_until hang (#163458)
[SymmMem] Fix put_signal + wait_until hang (#163194)

The test used a wrong ptr to refer to remote address:
```
            dst_ptr = out_hdl.buffer_ptrs[peer]
            src_ptr = inp_hdl.buffer_ptrs[rank]
            sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.

Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152

(cherry picked from commit 80f8be9840c20c3efe1274266b52ab098f4d1030)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-09-23 10:10:02 -07:00
7cf37ae3cb [2.9 cherry pick][triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (#163382) (#163583)
Includes:
* https://github.com/triton-lang/triton/pull/8211 to work around a PTXAS bug that was causing 03-matrix-multiplication tutorial matmuls to underperform due to excessive WGMMA waits
* https://github.com/triton-lang/triton/pull/8157 to fix a convert_layout bug

Verified that this passes Triton CI in https://github.com/pytorch/pytorch/pull/159158 and improves gemm perf (see https://github.com/pytorch/pytorch/issues/159704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163382
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-09-22 18:20:20 -07:00
f83cf0714e [graph partition] Add way to register custom rule (#163310) (#163395)
This PR adds an experimental way to register a custom rule for if
inductor should partition the graph around an operator.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163310
Approved by: https://github.com/ProExpertProg, https://github.com/BoyuanFeng, https://github.com/eellison
ghstack dependencies: #162117, #162307, #162651
2025-09-22 18:18:07 -07:00
ddd5074afc [CI] Update NVIDIA driver to 580.82.07 (#163522)
[CI] Update NVIDIA driver to `580.82.07` (#163111)

To make CI machines capable of running CUDA-13 tests. Unfortunately, this upgrade regresses NUMBA integration, so live patch it with 6e08c9d08e

This fix was suggested in https://github.com/pytorch/pytorch/issues/162878#issuecomment-3288635745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163111
Approved by: https://github.com/huydhn

(cherry picked from commit 8dbac62edb48815dfca84dfdcca40d6a24d0652b)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-09-22 11:45:48 -04:00
35c55da805 [Graph Partition] improve custom op output alias (#163380)
[Graph Partition] improve custom op output alias (#163227)

For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519

(cherry picked from commit 4967ad8baa724b8b1acc123698bb1265723feb87)

Co-authored-by: Boyuan Feng <boyuan@meta.com>
2025-09-19 16:36:03 -07:00
a576d48637 Skip test_ind_worker_queue on Windows and macOS (flaky) (#163363)
Skip test_ind_worker_queue on Windows and macOS (flaky) (#162555)

Fixes https://github.com/pytorch/pytorch/issues/68643

It was closed by the bot yesterday and the issue was still there https://github.com/pytorch/pytorch/actions/runs/17595694816/job/49989589647.  It's better to just skip it directly in the code as this test has been disabled on Windows and MacOS since 2021 O_o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162555
Approved by: https://github.com/clee2000

(cherry picked from commit 98e22c8a693644c6d235d7a858dc411b1aefafa7)

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-19 13:07:00 -07:00
25d8c0be68 Add decomp rule to assert_tensor_metadata for BatchedTensors (#163361)
Add decomp rule to assert_tensor_metadata for BatchedTensors  (#163008)

Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor

Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008
Approved by: https://github.com/huydhn

(cherry picked from commit e28983be76aa4651e3cb69dc3a4234d75038d938)

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
2025-09-19 13:00:57 -07:00
b1aae80953 [Cherry Pick][Graph Partition] allow sharing default device context (#163097)
cherry pick PR 162873
2025-09-19 11:10:29 -07:00
eqy
76bebf38de [Release 2.9] [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Met… (#163265)
[cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104)

For https://github.com/pytorch/torchtitan/issues/1713

Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time 1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104
Approved by: https://github.com/drisspg
2025-09-19 10:53:04 -07:00
bc158ebdc7 [SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163262)
[SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152)

1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong).

2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152
Approved by: https://github.com/ngimel
ghstack dependencies: #163025

(cherry picked from commit 57a54a04b6eb78e0aa7d13b48e25fb8c0c49fd60)

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-09-19 10:51:02 -07:00
ffa6f63fe2 Revert "Make distributed modules importable even when backend not bui… (#163024)
Revert "Make distributed modules importable even when backend not built (#159889)" (#162568)

This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568
Approved by: https://github.com/huydhn

Co-authored-by: Edward Yang <ezyang@meta.com>
2025-09-19 10:34:55 -07:00
baab5c6c8b [ONNX] Update export docstring & Set fallback=False by default (#162637)
* [ONNX] Update export docstring (#162622)

Update export docstring to reflect the latest configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622
Approved by: https://github.com/titaiwangms

(cherry picked from commit 7e2e83cdbe532b230dee40cfe0454116c9b64710)

* Change fallback option to False in ONNX export

* Change fallback parameter default to False

---------

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-09-16 17:23:47 -07:00
9718af107e Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162738)
Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240)

This makes gemma3 exportable on transformers=4.55.4

In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation.

Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically:
1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied
2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types.
3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator.
4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark)

Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758

Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240
Approved by: https://github.com/ydwu4

(cherry picked from commit 463fbc8ca0537e5635236190d2ca38ce6fcef831)

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
2025-09-16 17:22:16 -07:00
7f8ba48c2a Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162778)
Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162168)

This issue was introduced by the commit in issue #161065. Added an extra check to provide a proper path for other platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162168
Approved by: https://github.com/mingfeima, https://github.com/malfet


(cherry picked from commit 563921619b3e820b170475b9278ff94ee6e1a32c)

Co-authored-by: Yuxingwang-intel <yuxing.wang@intel.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-16 17:21:10 -07:00
aebf427c53 [Release 2.9] Update torch-xpu-ops commit pin (#162935)
Update commit pin to f8408a
2025-09-16 17:19:31 -07:00
44baf2ff8d fix deterministic scatter_add path for multi-d tensors (#162977)
fix deterministic scatter_add path for multi-d tensors (#162866)

PReviously for more than 2d tensor `select` didn't work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866
Approved by: https://github.com/valentinandrei

(cherry picked from commit bf6b40da3e3be7718b8ddc94eed2da8cabaa5e86)

Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
2025-09-16 17:17:36 -07:00
1076941ff7 [ONNX] Fix rotary_embedding_23 implementation (#163041)
[ONNX] Fix rotary_embedding_23 implementation (#162865)

The implementation of rotary_embedding_23 when input is 3D was incorrect.

## Tested

Locally with

```py
import onnx_ir as ir
import onnx
import torch
import os
import numpy as np

base_path = "/home/justinchu/dev/onnx/onnx/backend/test/data/node"
test_names = [
    "test_rotary_embedding",
    "test_rotary_embedding_3d_input",
    "test_rotary_embedding_interleaved",
    "test_rotary_embedding_no_position_ids",
    "test_rotary_embedding_no_position_ids_interleaved",
    "test_rotary_embedding_no_position_ids_rotary_dim",
    "test_rotary_embedding_with_interleaved_rotary_dim",
    "test_rotary_embedding_with_rotary_dim",
]
model_paths = [os.path.join(base_path, name) for name in test_names]

for path in model_paths:
    print(f"Checking {path} for issues...")

    model = onnx.load(os.path.join(path, "model.onnx"))
    input0 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_0.pb"))
    ).numpy()
    input1 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_1.pb"))
    ).numpy()
    input2 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_2.pb"))
    ).numpy()
    if os.path.exists(os.path.join(path, "test_data_set_0", "input_3.pb")):
        input3 = ir.from_proto(
            onnx.load_tensor(os.path.join(path, "test_data_set_0", "input_3.pb"))
        ).numpy()
    else:
        input3 = None
    output0 = ir.from_proto(
        onnx.load_tensor(os.path.join(path, "test_data_set_0", "output_0.pb"))
    ).numpy()

    m = ir.from_proto(model)

    node = m.graph[-1]
    print(node)
    assert node.op_type == "RotaryEmbedding"

    interleaved = node.attributes.get_int("interleaved", 0)
    num_heads = node.attributes.get_int("num_heads", 0)
    rotary_embedding_dim = node.attributes.get_int("rotary_embedding_dim", 0)

    torch_out = torch.onnx.ops.rotary_embedding(
        torch.tensor(input0),
        torch.tensor(input1),
        torch.tensor(input2),
        position_ids=torch.tensor(input3) if input3 is not None else None,
        interleaved=bool(interleaved),
        num_heads=num_heads,
        rotary_embedding_dim=rotary_embedding_dim,
    )
    torch_out = torch_out.detach().cpu().numpy()
    np.testing.assert_allclose(torch_out, output0)
```

Fix https://github.com/pytorch/pytorch/issues/162848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162865
Approved by: https://github.com/kunal-vaishnavi, https://github.com/titaiwangms

(cherry picked from commit fdf68fa5d70abebee1c5090a51ea30c7aa40b9b0)

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-09-16 17:16:23 -07:00
0ac9fa4413 [ez][CI] Fix docs push in nightly workflow (#163085)
[ez][CI] Fix docs push in nightly workflow (#162657)

HUD metrics page says docs push hasn't happened in 21 days
<img width="293" height="142" alt="image" src="https://github.com/user-attachments/assets/f930aab8-0503-4bf2-b962-8c375dec6b78" />

I guess main branch docs just haven't been updated?  Did anyone notice?  Do we care?

Either way I think this should fix it

Likely started after https://github.com/pytorch/pytorch/pull/161182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162657
Approved by: https://github.com/huydhn

(cherry picked from commit 2f533959430c2a41fe16ef79fe4d680a5c4e0585)

Co-authored-by: Catherine Lee <csl@fb.com>
2025-09-16 12:04:17 -07:00
152383b745 fix typo: summit -> submit (#162597)
fix typo: summit -> submit (#162587)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162587
Approved by: https://github.com/justinchuby

(cherry picked from commit fefc406a3d0d90db0f808419fb88045f90b213cd)

Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
2025-09-12 11:41:11 -04:00
c31a8186c1 [CD] Aarch64 Fix packaging `libarm_compute.so` and other libraries to the aarch64 CUDA wheels (#162596)
[CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162566)

Fixes aarch64 linux packaging, following error:
https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62
```
Traceback (most recent call last):
  File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module>
    import torch
  File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libarm_compute.so: cannot open shared object file: No such file or directory
```
Due to missing dependencies.

Current Error:
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Hence the repackaging does not take any effect.

This PR does following
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl  deleted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl

Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file.
```
2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling
2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1
2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5
2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so
2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so
2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0
2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0
2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0
2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0
2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64
2025-09-10T10:10:14.1503482Z  to Tag: cp310-cp310-manylinux_2_28_aarch64
2025-09-10T10:10:14.1503682Z
2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
```

Test Plan, Executed on local file:
```
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD
Bundling CUDA libraries with wheel
Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64
 to Tag: cp310-cp310-manylinux_2_28_aarch64

Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts
Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl..
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566
Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug

(cherry picked from commit 3d32bb114bf0d5bd0193dc40f20253635dddf080)

Co-authored-by: atalman <atalman@fb.com>
2025-09-10 12:22:02 -04:00
ce928e17c1 CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162501)
CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425)

Related to https://github.com/pytorch/pytorch/issues/162333
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425
Approved by: https://github.com/tinglvv, https://github.com/malfet

(cherry picked from commit e38e953432764e00f16999c8b7df6346ad357a16)

Co-authored-by: atalman <atalman@fb.com>
2025-09-09 14:27:57 -04:00
cd2c98a5b5 [Release 2.9] Release only changes (#162493) 2025-09-09 11:15:20 -07:00
4840a1a591 Run vLLM tests on all trunk commits before 2.9 branch cut (#161797)
This makes it easier to bisect issue now given that we don't have lots of time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161797
Approved by: https://github.com/yangw-dev
2025-09-09 05:56:41 +00:00
d49205fe1f Add more tests for vllm and clean out the old vllm test (#162292)
Test failure coverage from pytorch 2.8 release issues
[internal access only](https://docs.google.com/document/d/1zvK1eUAHubHGGHg9jKxd-QlP89fzgfqOBvE2m9mUs90/edit?tab=t.0
)

See coverage mapping
| Given test / pattern | Suite ID (from config) |
|---|---|
| pytest -v -s basic_correctness/test_cumem.py | vllm_basic_correctness_test |
| pytest -v -s entrypoints/openai/test_sleep.py | vllm_entrypoints_test |
| pytest -v -s entrypoints/openai/test_translation_validation.py::test_long_audio_request | vllm_entrypoints_test |
| pytest -v -s lora/test_quant_model.py | vllm_lora_28_failure_test |
| pytest -v -s -x tests/lora/test_llama_tp.py | vllm_lora_tp_test_distributed |
| pytest -v -s distributed/test_sequence_parallel.py -k test_tp_sp_generation |vllm_distributed_test_28_failure_test |
| pytest -v -s distributed/test_sequence_parallel.py::test_tp_sp_generation[...] | vllm_distributed_test_28_failure_test |
| pytest models/language/generation/test_mistral.py::test_models[...] | vllm_languagde_model_test_extended_generation_28_failure_test |
| pytest models/multimodal/pooling/test_jinavl_reranker.py::test_model_text_image[...] | vllm_multi_model_test_28_failure_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen25vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora_beam_search | vllm_lora_test |
| tests/lora/test_phi.py::test_phi2_lora | DIDN'T FIND IT IT IN VLLM |
| models/multimodal/generation/test_voxtral.py::test_models_with_multiple_audios[5-128-half] | vllm_multi_model_test_28_failure_test |
| models/test_initialization.py::test_can_initialize[VoxtralForConditionalGeneration] | vllm_basic_models_test |
| pytest -v -s -x lora/test_chatglm3_tp.py -k test_chatglm3_lora_tp4_fully_sharded_loras | vllm_lora_tp_test_distributed |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162292
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-09-09 05:53:46 +00:00
d85392a88e Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-09 05:42:19 +00:00
7feb8fc589 [SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available (#162142)
Summary:
As we have multiple backends, _SymmetricMemory should not be imported together with NVSHMEM related modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162142
Approved by: https://github.com/dcci, https://github.com/kwen2501
2025-09-09 05:37:43 +00:00
60d009267e Revert "testing infra and some fixes (#162183)"
This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d.

Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))
2025-09-09 05:26:32 +00:00
4590438329 [fx] fix qualified name for methods of torch.Tensor (#162407)
This fixes an error in the previous PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2025-09-09 05:14:43 +00:00
8494afb837 Add missing fstream include to fix std::ofstream compilation error (#162421)
## Summary
This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard *Google internal compile setup* (built with bazel).

## Details
The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers.

## Error message:
```
torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>'
8 | std::ofstream file(filename_, std::ios::binary);
| ^
libcxx/include/__fwd/fstream.h:26:7: note: template is declared here
26 | class basic_ofstream;
| ^
1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421
Approved by: https://github.com/ezyang
2025-09-09 05:14:32 +00:00
7ad40de60e [audio hash update] update the pinned audio hash (#162437)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162437
Approved by: https://github.com/pytorchbot
2025-09-09 04:41:34 +00:00
607327beae [vllm hash update] update the pinned vllm hash (#162356)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162356
Approved by: https://github.com/pytorchbot
2025-09-09 04:40:25 +00:00
f216d64bfe [SymmMem] Better tuning of A2AV based on accurate node boundary (#162003)
Use `world_within_direct_access()` to distinguish intra- vs inter- node.
Previously we assumed a fixed node size of 8, which is not true for NVL72.

Also added env var `TORCH_SYMMMEM_NBLOCKS` for control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162003
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-09-09 04:18:17 +00:00
847d7f21af [CUDA-13] Implement workaround for cudaErrorNotSupported (#162412)
See https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162412
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-09 04:12:10 +00:00
065c446193 [SymmMem] Use global pe for put and get (#162394)
NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel.

Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394
Approved by: https://github.com/ngimel, https://github.com/fegin
ghstack dependencies: #162320
2025-09-09 03:58:48 +00:00
98ecc0f374 [SymmMem] Add team pool to hold duplicated teams for the same rank group (#162320)
When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets).

This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162320
Approved by: https://github.com/ngimel
2025-09-09 03:58:48 +00:00
4c45090cf7 [DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798)
Fixes #159590

This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).

This also handles cases such as #159590  where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
2025-09-09 03:52:05 +00:00
1641606aa4 Revert "Add BundledAOTAutogradSerializableCallable (#162170)"
This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a.

Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))
2025-09-09 03:33:36 +00:00
7b8a64557d [inductor] fix 3d tiled online softmax (#162341)
The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162311
2025-09-09 02:59:52 +00:00
d8b6622bb6 testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162167
2025-09-09 02:42:11 +00:00
a965f09793 [export] Update PT2 archive docs (#162308)
Summary: Minor updates based on the recent refactoring for weight saving and loading

Test Plan:
doc change only

Rollback Plan:

Differential Revision: D81821994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162308
Approved by: https://github.com/angelayi
2025-09-09 02:08:13 +00:00
583bbf7761 [MPS] Add native_dropout and native_dropout_backward (#162108)
Fixes #162002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108
Approved by: https://github.com/malfet
2025-09-09 01:44:06 +00:00
e025c0f459 Dynamo: set_eval_frame microoptimization (#162220)
Optimize for common case and remove a pair of refcount operations (see new comments.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162220
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219
2025-09-09 01:10:06 +00:00
a8a187b2cf Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef (#162219)
Avoids requiring vector allocation to call this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162219
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692
2025-09-09 01:10:06 +00:00
12db2a7889 Call checkLong in is_int_or_symint, completing TODO (#161692)
Calling this first minimizes overhead for plain old ints, making cheap things cheap.

Differential Revision: [D81530098](https://our.internmc.facebook.com/intern/diff/D81530098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161692
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634
2025-09-09 01:10:06 +00:00
eab2afeff7 fastpath type Tensor in THPVariable_NewWithVar (#161634)
It is cheap to do an exact check against Tensor and much faster when it works (PyType_IsSubtype does not have this fastpath, I checked [source](9ee0214b5d/Objects/typeobject.c (L2889))). Spot-checked in perf on detach-DTensor-in-a-loop benchmark; small win but clear.

Differential Revision: [D81530101](https://our.internmc.facebook.com/intern/diff/D81530101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161634
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #161591, #161595, #161633
2025-09-09 01:10:06 +00:00
a951f435fd Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function (#161633)
py::args::size() calls PyTuple_GetSize. Compiler can't know the two calls will always return the same result, so we have to consolidate them ourselves.

Differential Revision: [D81530096](https://our.internmc.facebook.com/intern/diff/D81530096)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161633
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595
2025-09-09 01:10:06 +00:00
6eb14ac60f [Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447)
This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior.

Fixes: #140457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447
Approved by: https://github.com/mlazos
2025-09-09 01:07:02 +00:00
ed77e23b68 Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482)"
This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7.

Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))
2025-09-09 00:21:05 +00:00
897c4e70a7 Move to small wheel approach for CUDA SBSA wheel (#160720)
https://github.com/pytorch/pytorch/issues/160673

Use download.pytorch.org's dependencies like x86 build instead of bundling libs into the wheel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160720
Approved by: https://github.com/atalman
2025-09-09 00:18:43 +00:00
8485aac873 [precompile] Fix inlined source tracking with generators. (#162389)
Summary:
When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location.

In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field.

Test Plan:
test_package.py -k test_code_with_generator

Rollback Plan:

Differential Revision: D81929751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389
Approved by: https://github.com/dolpm, https://github.com/hrithick-codes
2025-09-09 00:13:54 +00:00
c0fc86b511 Fix aarch64 wheel pack (#159481)
PR that introduced the change: https://github.com/pytorch/builder/pull/1775
Use wheel pack instead of zip to repack the wheel.
It should regenerate the RECORD file and update all the hashes correctly.

TODO:
Apply wheel pack instead of zip to Rest of builds
Add validation test to make sure wheel contents matches RECORD file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159481
Approved by: https://github.com/malfet
2025-09-08 23:36:50 +00:00
07f07309c6 [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/huydhn
2025-09-08 23:30:11 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
5fd6b6a2db [refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189)
## Summary
- document guard behavior in `SizeVarAllocator.is_size_one`
- use `is_size_one` for broadcast/expand checks.
- This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)`

a4f9132a17/torch/_inductor/sizevars.py (L450-L453)

------
https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189
Approved by: https://github.com/laithsakka
2025-09-08 22:48:16 +00:00
ac9ccd0dc2 Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-08 22:44:48 +00:00
711c8c821e shape guards (#161178)
Summary: This PR introduces shape guards to export. Previously only value ranges,  equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program.

Test Plan:
updated several tests

Rollback Plan:

Differential Revision: D80713603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178
Approved by: https://github.com/tugsbayasgalan
2025-09-08 22:44:09 +00:00
2c538c9acf rewrite __maybe_broadcast should_expand check for unbacked (#162109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162109
Approved by: https://github.com/aorenste
ghstack dependencies: #162084, #162099
2025-09-08 22:41:18 +00:00
85fe94e933 make should_swap more dde friendly (#162099)
unblock customers for common cases with DDE ,until @pianpwk  land the change to should_swap https://github.com/pytorch/pytorch/pull/160473.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162099
Approved by: https://github.com/aorenste
ghstack dependencies: #162084
2025-09-08 22:41:18 +00:00
fecd9686f5 Graph split event tracker (#159795)
Summary:
A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs.

Usage: use env var to specify a mode and necessary arguments.

FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode.
```
Different modes of the event tracker:
"0": Tracker not enabled (by default)
"1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb.
"2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt
"3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt
"4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt
```
FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`.
FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3".

Test Plan: New unit test

Reviewed By: georgiaphillips

Differential Revision: D79203595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795
Approved by: https://github.com/ezyang
2025-09-08 21:30:17 +00:00
dd44faa9d9 Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#162103)"
This reverts commit 0af70e2353e1dcda83175fd4834ecb7b63e009e0.

Reverted https://github.com/pytorch/pytorch/pull/162103 on behalf of https://github.com/jithunnair-amd due to Cirrascale network outage resolved. Reverting back to running per commit to aid in triage and CI health ([comment](https://github.com/pytorch/pytorch/pull/162103#issuecomment-3267977825))
2025-09-08 20:53:05 +00:00
5d819f3faf Revert "[associative_scan] Autograd separated (#139939)"
This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8.

Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))
2025-09-08 20:42:47 +00:00
015423bef8 Add fp16-overflow regression test (#162401)
Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-08 20:33:23 +00:00
26a1b9cce2 [dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318)
Fixes https://github.com/pytorch/pytorch/issues/162313

Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305
2025-09-08 20:26:24 +00:00
8f114650eb Add std::any_of to ConvParams struct (#162334)
Removes some for-loops that didn't short-circuit in favor of std::any_of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162334
Approved by: https://github.com/Skylion007
2025-09-08 20:12:20 +00:00
ec2c1371af [BE]: Update cudnn frontend submodule to 1.14.1 (#162347)
Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-08 20:03:23 +00:00
8ec01f34e9 [bucketing] custom_ops mode to hide inductor copies overhead (#161499)
Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each,
to workaround too many generated kernels on inductor side.

This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems.

Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499
Approved by: https://github.com/eellison
2025-09-08 20:03:08 +00:00
9c991b63ff [CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 build (#162364)
https://github.com/pytorch/pytorch/issues/159779

Add the full CUDA support matrix to sbsa build (12.6, 12.8)
Same arch support as x86 build
Remove 12.9 sbsa build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162364
Approved by: https://github.com/atalman
2025-09-08 20:00:25 +00:00
4e50651c5f [DTensor] fix F.one_hot (#162307)
F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due
to an arange call creating a new Tensor (not DTensor). This PR fixes it
by allowing implicit replication of Tensors for the arange call and the
one consumer of the arange call (the at::eq call).

Test Plan:
- new test. Also, F.one_hot(num_classes=-1) is broken so we skip that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307
Approved by: https://github.com/ezyang
ghstack dependencies: #162117
2025-09-08 19:37:08 +00:00
a0d026688c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-08 19:10:36 +00:00
d80297a684 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-08 19:10:36 +00:00
fbcabb4fbd Handle f([]) vs. f() in fake tensor caching (#162284)
Fixes https://github.com/pytorch/pytorch/issues/162279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162284
Approved by: https://github.com/manuelcandales, https://github.com/aorenste
2025-09-08 18:28:05 +00:00
314d47a210 [audio hash update] update the pinned audio hash (#162315)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315
Approved by: https://github.com/pytorchbot
2025-09-08 18:26:33 +00:00
bc4176c92a CD Windows CUDA 13.0 build - fix packaging of cuda dlls (#162383)
Trying to fix https://github.com/pytorch/pytorch/issues/162333

CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture :
<img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" />
<img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162383
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet
2025-09-08 17:57:22 +00:00
eqy
de5dc1f038 [cuDNN][SDPA][Nested Tensor] add forward/backward caching support for cuDNN SDPA Nested tensor/varlen (#161434)
Don't recompile every time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161434
Approved by: https://github.com/drisspg
2025-09-08 17:51:13 +00:00
72e6717d00 Avoid crash with release_available_cached_blocks (#162269)
updated release behavior for cached blocks
Fixes #159567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162269
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-09-08 17:46:43 +00:00
ebd29a13fe [inductor] fuse for scalar shared data (#162311)
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
2025-09-08 17:20:46 +00:00
5793dd7875 [Intel GPU] Integrate OneDNN SDPA training forward and backward (#161058)
This PR is the first split PR of https://github.com/pytorch/pytorch/pull/156272, only contains the OneDNN code. Please help review.

Pending on OneDNN v3.9 commit update. Don't merge.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161058
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-09-08 17:07:31 +00:00
49c446c617 Add C++ function for torch.distributed.tensor._op_schema.is_view_op (#161595)
This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: #161466, #161586, #161590, #161591
2025-09-08 16:28:08 +00:00
8e076d889c Don't call check_has_torch_dispatch in THPVariable_NewWithVar if we already know (#161591)
We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161591
Approved by: https://github.com/ezyang
ghstack dependencies: #161466, #161586, #161590
2025-09-08 16:28:08 +00:00
f044fa2902 [AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041)
The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: #162040
2025-09-08 16:12:52 +00:00
a92773eeb1 Revert "Use vectorized stores for all dtypes in cat (#161649)"
This reverts commit 377033757ae5ca524ea842f1b0a5f446ed3d8fe0.

Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3266963044))
2025-09-08 15:58:58 +00:00
53297f6ad0 Revert "[audio hash update] update the pinned audio hash (#162315)"
This reverts commit c9ac8c25ef9ad020542898ab569910a9d0cd1f7e.

Reverted https://github.com/pytorch/pytorch/pull/162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](https://github.com/pytorch/pytorch/pull/162315#issuecomment-3266932718))
2025-09-08 15:52:30 +00:00
25c170b72e [inductor] Runtime estimations: use nccl estimator; mm only benchmark mode (#161405)
During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms.

Adding optional usage of:
- c10d.time_estimator for collectives, which is based on NCCL estimator

Benchmark mode only for matmuls, as they are highly dependent on mm backend

- The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572

This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()`

Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405
Approved by: https://github.com/eellison
2025-09-08 14:33:19 +00:00
3f5993316e [upstream triton] update triton pin to triton 3.5 (#162278)
Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton)

Notably:
* this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR)
* sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278
Approved by: https://github.com/atalman
ghstack dependencies: #162244, #162309
2025-09-08 14:29:24 +00:00
e101411b9f Update slow tests (#161395)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161395
Approved by: https://github.com/pytorchbot
2025-09-08 13:33:32 +00:00
32911ff541 [xla hash update] update the pinned xla hash (#162372)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162372
Approved by: https://github.com/pytorchbot
2025-09-08 11:31:16 +00:00
5b90e85112 [AsyncTP] Fixes AsyncMM (#162040)
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
2025-09-08 10:53:59 +00:00
31d5c67539 [inductor][triton] support static cuda launcher after triton # 7866 (#162309)
Fixes static cuda launcher after https://github.com/triton-lang/triton/pull/7866.

Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162309
Approved by: https://github.com/aakhundov
ghstack dependencies: #162244
2025-09-08 07:57:48 +00:00
fb0afa853e [inductor][triton] more JITCallable._hash_lock support (#162244)
Follow-up to #161768.

Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished.

In #161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable.

It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock.

To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock)

Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162244
Approved by: https://github.com/atalman
2025-09-08 07:57:48 +00:00
1e0656f063 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit de893e96c775023aa3be895060848fac3296772c.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
29e09a6545 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
c9ac8c25ef [audio hash update] update the pinned audio hash (#162315)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315
Approved by: https://github.com/pytorchbot
2025-09-08 04:17:23 +00:00
103f725afa [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/ydwu4
2025-09-08 03:21:17 +00:00
5babb4d5c0 Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-07 23:37:31 +00:00
eb9073a6b7 [easy] [precompile] Convert CompileArtifacts to callable (#162169)
The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module.
Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162169
Approved by: https://github.com/zhxchen17
2025-09-07 23:37:31 +00:00
ec2e3687c7 [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
2025-09-07 21:55:29 +00:00
ff2de5d522 Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473)"
This reverts commit 040d00af048967dde7938d358d7f5988cbd18388.

Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983))
2025-09-07 21:06:38 +00:00
8235c4f65d Revert "[ROCm] Enabling several UTs (#161715)"
This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354.

Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))
2025-09-07 21:03:17 +00:00
e246a85b76 Revert "[1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)"
This reverts commit 5c473e9f5ee0ef0fc38e6cf34a95b547f8cdc8d5.

Reverted https://github.com/pytorch/pytorch/pull/159118 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473 ([comment](https://github.com/pytorch/pytorch/pull/159118#issuecomment-3264037799))
2025-09-07 21:00:29 +00:00
df59c21768 Revert "[BE] Cleanup stale comments/copy from gemm (#162001)"
This reverts commit 6087ef41e54c2494b117ffd923faf20f515a6806.

Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to breaks internal ads signal, see [D81845017](https://www.internalfb.com/diff/D81845017) ([comment](https://github.com/pytorch/pytorch/pull/162001#issuecomment-3264034312))
2025-09-07 20:53:16 +00:00
093ab5f477 Revert "[inductor] add kernel template choice (ktc) (#161347)"
This reverts commit 9a8d454c464c0b811fc4586ff104424bccf1da0c.

Reverted https://github.com/pytorch/pytorch/pull/161347 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))
2025-09-07 20:39:39 +00:00
4348db0b92 Revert "[inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)"
This reverts commit c32111149921b48bfef909293f1049e21619ed76.

Reverted https://github.com/pytorch/pytorch/pull/161348 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](https://github.com/pytorch/pytorch/pull/161347#issuecomment-3264027436))
2025-09-07 20:39:39 +00:00
9ad5e8edb1 Improve typing of ONNX decorators with ParamSpec (#162332)
## Summary
This PR improves typing in ONNX-related modules by replacing TypeVar bound to Callable[..., Any] with ParamSpec to preserve parameter types and avoid type erasure in decorator functions.

## Changes
- `torch/onnx/_internal/exporter/_flags.py`: Replace TCallable TypeVar with ParamSpec
- `torch/onnx/ops/_impl.py`: Replace _T TypeVar with ParamSpec for _onnx_op decorator
- `torch/onnx/_internal/exporter/_torchlib/_torchlib_registry.py`: Replace _T TypeVar with ParamSpec

## Motivation
The previous implementation used TypeVar bound to Callable which erased parameter type information to Any. ParamSpec preserves the exact parameter types and return types, providing better type safety and IDE support.

## Testing
- Verified all changes compile and import correctly
- Created comprehensive test suite to validate ParamSpec functionality
- No linting errors introduced
- Maintains backward compatibility

Fixes #142306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162332
Approved by: https://github.com/Skylion007
2025-09-07 18:06:03 +00:00
7a83cf430e Revert " [while_loop][autograd] support autograd_key of while_loop (#160483)"
This reverts commit 2b8a83901c58a0858ea9e4ce00055f48e6ed164c.

Reverted https://github.com/pytorch/pytorch/pull/160483 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but some trunk tests are failing either from this PR or the previous one in the stack ([comment](https://github.com/pytorch/pytorch/pull/160483#issuecomment-3263597325))
2025-09-07 08:50:49 +00:00
ada43ed39c Revert "[inductor] pdl inductor option (disabled by default) (#160928)"
This reverts commit 9458d1ac3bd70c2af316a8ba95d2c6c9c1199c9c.

Reverted https://github.com/pytorch/pytorch/pull/160928 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160928#issuecomment-3263560378))
2025-09-07 07:37:37 +00:00
93fb23d6fa Build vLLM nightly wheels (#162000)
This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated.  The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli

1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get
2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build
3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist`
4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89]
5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled
6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM

We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date.

### Usage

* Install latest nightlies
```
pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \
  --index-url https://download.pytorch.org/whl/nightly/cu129
```

* Install a specific version
```
pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \
  vllm==1.0.0.dev20250903 \
  xformers=0.0.33.dev20250903 \
  flashinfer_python=0.2.14.dev20250903 \
  --index-url https://download.pytorch.org/whl/nightly/cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000
Approved by: https://github.com/atalman
2025-09-07 06:09:17 +00:00
104f2680e0 Revert "Add return-max-scores to flex-attention (#161667)"
This reverts commit 486b20b73cfcf32a773a4301b1b97f91c157ce76.

Reverted https://github.com/pytorch/pytorch/pull/161667 on behalf of https://github.com/huydhn due to Sorry for reverting your change but reverting https://github.com/pytorch/pytorch/pull/161730 does not seem to fix all trunk failures ([comment](https://github.com/pytorch/pytorch/pull/161667#issuecomment-3263512642))
2025-09-07 06:00:55 +00:00
eac3d6f04c Revert "[inductor] fuse for scalar shared data (#162311)"
This reverts commit 2a45837e98c63cae9d1a2e2133a727b829e549d5.

Reverted https://github.com/pytorch/pytorch/pull/162311 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is breaking lint ([comment](https://github.com/pytorch/pytorch/pull/162311#issuecomment-3263511162))
2025-09-07 05:57:43 +00:00
fea20775ad [vllm hash update] update the pinned vllm hash (#162314)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162314
Approved by: https://github.com/pytorchbot
2025-09-07 04:29:23 +00:00
2a45837e98 [inductor] fuse for scalar shared data (#162311)
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
ghstack dependencies: #162028, #162221, #162303
2025-09-07 01:48:45 +00:00
b919560c4a [nativert] AOTI lowering and packaging as NativeRT delegate (#162285)
Summary:
A demo for creating AOTI delegate for NativeRT in OSS.

- It supports full graph lowering only.
- It leverages `executorch_call_delegate` HOP but doesn't rely on `executorch`.
- The delegate graph is obtained by tracing a `LoweredBackendModule` whose forward function calls `executorch_call_delegate`.
- The main difference between `executorch_call_delegate` and `aoti_call_delegate` is that the delegate graph from `executorch_call_delegate` doesn't have weights lifted as inputs.
- original_ep and delegate_ep are treated as flat EP dictionary and there is no nested structure.
- The naming contract is enforced by `model_name` and `backend_id`

Test Plan:
CI

Rollback Plan:

Differential Revision: D81641157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162285
Approved by: https://github.com/dolpm
2025-09-07 01:29:54 +00:00
e3068cdb44 [dynamo] Use relaxed CLOSURE_MATCH guard then ID_MATCH (#162247)
I am unable to write a test that would fail here. The reason is that when we do _dynamo.disable(fn) in the compiled frame, the id of disabled function changes but currently we guard on the original function - `fn` whose id is not changing. This PR still guards on the `fn.__code__` just to be more precise.

Thanks to @thenumberouscode for pointing this out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162247
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-09-07 01:25:52 +00:00
5211f1f908 [export] Move example inputs in move_to_device_pass (#162301)
Summary:
If i have a EP that's exported on CPU and want to AOTI compile it for CUDA. I need to use `move_to_device_pass`.

But in `torch._inductor.aoti_compile_and_package()`, it directly uses the `example_inputs` attached to the EP, so we should move the example inputs as well if applicable.

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_move_device_example_inputs

Rollback Plan:

Differential Revision: D81812366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162301
Approved by: https://github.com/angelayi
2025-09-06 23:54:54 +00:00
2b8a83901c [while_loop][autograd] support autograd_key of while_loop (#160483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160483
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160467
2025-09-06 21:26:33 +00:00
48e3be3ab6 [while_loop][autograd] add hop while_loop_stack_output (#160467)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160467
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-09-06 21:26:33 +00:00
5927a70934 NLLLoss: validate target is 0D when input is 1D (#161412)
Add a shape check in nll_loss_forward to error out when both input and target are 1D. Added a unit test to cover the incompatible 1D/1D case.

Fixes #157420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161412
Approved by: https://github.com/ngimel
2025-09-06 20:58:42 +00:00
1a588ace46 [inductor] rename deps during refreshing (#162303)
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162028, #162221
2025-09-06 20:38:28 +00:00
541aa23de5 [inductor] fix TemplateBuffer.extract_read_writes (#162221)
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162028
2025-09-06 20:38:28 +00:00
047603d35b New export implementation with flat inp/out (#162167)
This is my first attempt of building new export API. The main thing it addresses is correctly getting input and output relations. Subsequent diffs willl add functionality for dynamic shapes, nn_module_stack etc.

Differential Revision: [D81793205](https://our.internmc.facebook.com/intern/diff/D81793205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162167
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2025-09-06 20:03:52 +00:00
ae0edc133e [3/N] Enable 6 fsdp test on Intel GPU (#161601)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR is created base on PR https://github.com/pytorch/pytorch/pull/158533 and https://github.com/pytorch/pytorch/pull/159473 and will work on some test files under test/distributed/fsdp. We could enable Intel GPU with following methods and try the best to keep the original code styles in this PR:

1. add allow_xpu=True in instantiate_device_type_tests() if needed.
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend

3. enabled XPU for some test path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161601
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-06 16:47:13 +00:00
b6d0a9ea90 MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209)
## Summary
- We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816
- This is needed for backward pass of mxfp8 MoE training with grouped gemms
- Changes:
    - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm`
    - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile
    - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs
    - Bump FBGEMM third party submodule to include:
          - https://github.com/pytorch/FBGEMM/pull/4816
          - https://github.com/pytorch/FBGEMM/pull/4820
          - https://github.com/pytorch/FBGEMM/pull/4821
          - https://github.com/pytorch/FBGEMM/pull/4823

#### How fbgemm dependency was bumped
Documenting this since I haven't found it documented elsewhere:
- `cd ~/pytorch/third_party/fbgemm`
- `git fetch`
- `git checkout <hash>`
- `cd ~/pytorch`
- `git add third_party/fbgemm`

## Test plan

#### Test build
```
USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e .
...
Successfully installed torch-2.9.0a0+gitf5070f3
```
[full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581)

#### Unit tests
```
pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_
...

test/test_matmul_cuda.py .........                                                                                                                        [100%]

============================================================== 9 passed, 1668 deselected in 5.34s ===============================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209
Approved by: https://github.com/ngimel
2025-09-06 15:25:30 +00:00
eqy
5985e28912 [CUDA 13][cuDNN][Windows] Roll back cuDNN upgrade from 9.13 to 9.12 on Windows (#162322)
Forward fix for #162268

CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162322
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2025-09-06 13:32:07 +00:00
9aedb3cd87 [AOTI-FX] Support registering custom FX backends (#162317)
# Feature
Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen.

# Test plan
Added a CI test registering a custom FX backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317
Approved by: https://github.com/jansel
2025-09-06 07:32:03 +00:00
0ff8eabf13 Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670)"
This reverts commit 146371483318e17929daefd37c8e459d9d6d47bb.

Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379267 and https://github.com/pytorch/pytorch/actions/runs/17507127561/job/49733379271 ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3261241229))
2025-09-06 06:18:57 +00:00
28f4ab0737 Add -Wno-ctad-maybe-unsupported compiler flag (#162223)
When running bazel build, we (Google) run into the following error.
The `-Wctad-maybe-unsupported` warning would be raised to an error and break the build in certain cases.
So, we propose to suppress the warning to make the build with bazel more smooth.

This is the error message we got:
```
c10/util/IntrusiveList.h:166:12: error: 'std::reverse_iterator' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]
  166 |     return std::reverse_iterator{end()};
      |            ^
c10/test/util/IntrusiveList_test.cpp:24:18: note: in instantiation of member function 'c10::IntrusiveList<(anonymous namespace)::ListItem>::rbegin' requested here
   24 |     auto it = c1.rbegin();
      |                  ^
c10/test/util/IntrusiveList_test.cpp:43:5: note: in instantiation of function template specialization '(anonymous namespace)::check_containers_equal<(anonymous namespace)::ListItem>' requested here
   43 |     check_containers_equal(l, v);
      |     ^
libcxx/include/__iterator/reverse_iterator.h:51:7: note: add a deduction guide to suppress this warning
   51 | class reverse_iterator
      |       ^
1 error generated.

```

@haifeng-jin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162223
Approved by: https://github.com/ezyang
2025-09-06 06:11:37 +00:00
c98ddaca6d Fixed comment to match logic in distributed_c10d.py (#162158)
inconsistent with the logic introduced in #162157  and modified in #142216.This update ensures the documentation matches the actual behavior of the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162158
Approved by: https://github.com/wconstab
2025-09-06 05:37:49 +00:00
bc505977fb torch.zeros bound checks for symint (#161976)
Fixes #161490

I added a bounds check for negative symints to create a better error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161976
Approved by: https://github.com/ezyang
2025-09-06 05:37:42 +00:00
aac1a50a19 Add api info for torch._C._nn.pyi (#162148)
Fix part of #148404

APis involved are as followed:

- cross_entropy_loss
- hardsigmoid_
- hardswish
- hardswish_
- huber_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162148
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2025-09-06 05:21:40 +00:00
20b47acef8 [fx] fix qualified name for methods of torch.Tensor (#162224)
Fixes #160077, #154721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162224
Approved by: https://github.com/ezyang
2025-09-06 05:16:19 +00:00
da4db4b33d Fix DeviceMesh._flatten docstring example (#162277)
Fix the `DeviceMesh._flatten` docstring example of use. Alternative fix would be to replace `mesh_3d["dp", "cp"]` with `mesh_3d["cp", "tp"]`.

(I verified the fix using the `gloo` backend)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162277
Approved by: https://github.com/ezyang
2025-09-06 05:00:00 +00:00
a3e5466002 Revert "Resize to 0 if not going to be used (#161730)"
This reverts commit 081cab045472ce045634548cc6c14a4870641e23.

Reverted https://github.com/pytorch/pytorch/pull/161730 on behalf of https://github.com/davidberard98 due to functorch/test_aotdispatch.py::TestAOTModuleSimplified::test_flex_attn_noncontiguous_tangents [GH job link](https://github.com/pytorch/pytorch/actions/runs/17506617662/job/49731934012) [HUD commit link](081cab0454) ([comment](https://github.com/pytorch/pytorch/pull/161730#issuecomment-3260492575))
2025-09-06 04:17:08 +00:00
c0983e6cc0 [Graph Partition] interface for custom cg wrapper (#162207)
This PR adds an interface to allow users to specify custom cudagraph wrapper. User example: [vllm](https://github.com/vllm-project/vllm/pull/24281)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162207
Approved by: https://github.com/zou3519, https://github.com/eellison, https://github.com/ProExpertProg
2025-09-06 03:13:01 +00:00
b2b4add0e7 Docs on export joint with descriptors (#159006)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159006
Approved by: https://github.com/SherlockNoMad
2025-09-06 03:02:58 +00:00
20629b1619 Add contiguous subgraph transformation threshold (#162192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162192
Approved by: https://github.com/coconutruben
2025-09-06 02:48:00 +00:00
c3ceca2995 codebase structure documentation to include torchgen (#162261)
📚 The doc update

adding description about torchgen folder in code structure guide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162261
Approved by: https://github.com/ezyang
2025-09-06 02:10:57 +00:00
145a3a7bda [CUDA 13][cuDNN] Bump CUDA 13 to cuDNN 9.13.0 (#162268)
Fixes some `d_qk` != `d_v` cases on Hopper that are broken by cuDNN 9.11-9.12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162268
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-09-06 01:59:03 +00:00
291cd11f2d [inductor] estimate peak memory in codegen only when buffer reuse (#162300)
As titled, this PR ensures peak memory is estimated only when buffer reuse is enabled. Without this config, some nodes' successor nodes are eliminated from memory estimation after inductor bucketing, which can cause errors.

The original codegen peak memory estimation code is from this PR: https://github.com/pytorch/pytorch/pull/159530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162300
Approved by: https://github.com/eellison, https://github.com/v0i0
2025-09-06 01:30:38 +00:00
7f4ff79210 remove deprecated vllm test (#162306)
Fixes https://github.com/pytorch/pytorch/issues/162274

the test is removed from vllm side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162306
Approved by: https://github.com/malfet
2025-09-06 01:27:13 +00:00
0f45aaf441 Disable autocast when running joint graph passes (#162304)
Fixes #159469. See https://github.com/pytorch/pytorch/issues/159469#issuecomment-3221474027 for root-cause analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162304
Approved by: https://github.com/bdhirsh, https://github.com/zou3519, https://github.com/eellison
2025-09-06 00:57:58 +00:00
4f72d932fe re-land triton runtime implementation" (#162217)
Summary: original pr - https://github.com/pytorch/pytorch/pull/161798

Test Plan:
ci

Rollback Plan:

Differential Revision: D81724234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162217
Approved by: https://github.com/SherlockNoMad
2025-09-06 00:52:29 +00:00
1463714833 [dynamo] Graph break on on user-defined class in compiled region (#161670)
Currently, user-defined classes inside of a compiled frame will cause the whole
frame to be skipped by dynamo.  This change defers the Unsupported exception
until the __build_class__ builtin is actually called, which allows a graph break
to be inserted.  Fixes #161562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas
2025-09-06 00:04:57 +00:00
081cab0454 Resize to 0 if not going to be used (#161730)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #161730
*  #161667

```Py
        with torch.cuda._DeviceGuard(0):
            torch.cuda.set_device(0)
            buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32)
            buf1 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32)
            buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32)
            # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
            stream0 = get_raw_stream(0)
            triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0)
            del arg0_1
            del arg1_1
            del arg2_1
            del arg3_1
            del arg4_1
            del arg5_1
            del arg6_1
            del buf0
            del buf1
        return (buf2, )
```

Vs

```Py
        with torch.cuda._DeviceGuard(0):
            torch.cuda.set_device(0)
            buf0 = empty_strided_cuda((2, 32, 1024), (32768, 1024, 1), torch.float32)
            buf1 = empty_strided_cuda((0, ), (1, ), torch.float32)
            buf2 = empty_strided_cuda((2, 32, 1024, 64), (2097152, 65536, 64, 1), torch.float32)
            # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
            stream0 = get_raw_stream(0)
            triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, buf1, arg4_1, arg3_1, arg5_1, arg6_1, buf2, 8, 2, 32, stream=stream0)
            del arg0_1
            del arg1_1
            del arg2_1
            del arg3_1
            del arg4_1
            del arg5_1
            del arg6_1
            del buf0
            del buf1
        return (buf2, )
```
<img width="428" height="145" alt="Screenshot 2025-08-28 at 12 37 11 PM" src="https://github.com/user-attachments/assets/240a7bca-97e1-40c4-bf93-f075fdc1a40d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161730
Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng
ghstack dependencies: #161667
2025-09-05 23:21:46 +00:00
486b20b73c Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-05 23:21:46 +00:00
4d4abec80f allow user to pass in custom partitioner function (#157580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157580
Approved by: https://github.com/bdhirsh
2025-09-05 22:49:39 +00:00
9c03d6be87 [CD][BE] Delete Python-3.9 case (#162265)
And raise error when building for an unsupported version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162265
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162297
2025-09-05 22:46:36 +00:00
8d50355d97 [CD][EZ] Update libtorch python version to 3.10 (#162297)
Not sure why it was at 3.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162297
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-09-05 22:46:36 +00:00
e0a62b266c [aot-precompile] default-filter global guards (#162090)
if the user doesn't provide their own guard filter fn, we should by default filter global guards.

pytest test/dynamo/test_aot_compile.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162090
Approved by: https://github.com/zhxchen17
2025-09-05 22:44:55 +00:00
01ab325cc2 [DCP][Quantization] Fix the issue when scale vector is in a different SafeTensors file (#162214)
Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files.

Test Plan:
buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage

Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082

Rollback Plan:

Differential Revision: D81718598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162214
Approved by: https://github.com/wwwjn
2025-09-05 22:43:58 +00:00
79fcd5247a symbolic cpp channels_last_contiguous (#160402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160402
Approved by: https://github.com/aorenste
2025-09-05 21:40:32 +00:00
70d36e047d Making batching rule for F.embedding DTensor-aware (#162117)
`vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's
batching rule generates a new tensor via at::arange, at::arange
generates a regular tensor, and DTensor rightfully errors on mixed
DTensor-regular Tensor operations.

This PR fixes the problem by activating DTensor implicit replication on
just the at::arange and the subsequent add operation.

In order to accomplish this I move the DTensor implicit replication flag
to C++ (most batching rules are in C++).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117
Approved by: https://github.com/bdhirsh
2025-09-05 21:40:14 +00:00
a00cdc1e41 [CD][BE] Get rid of SETUPTOOLS and PYYAML extra pins (#162266)
As those weren't really a pins to begin with, and requirments.txt
already has those
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162266
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162263, #162264
2025-09-05 21:32:52 +00:00
c10195e723 [C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633)
- Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it.
- Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo.

Fixes #156632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633
Approved by: https://github.com/d4l3k
2025-09-05 21:24:36 +00:00
771f369448 [Inductor] Improve RoPE (#161420)
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161420
Approved by: https://github.com/shunting314
2025-09-05 20:55:20 +00:00
92a43025e0 [cutlass backend] Add FP8 tests for multiple linears (#160782)
Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases.

We still have to skip the AOTI and dynamic case due to accuracy issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160782
Approved by: https://github.com/mlazos
2025-09-05 20:23:25 +00:00
2fa0520a64 [BE][pytree] cleanup parameterized pytree tests (#160842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160842
Approved by: https://github.com/Skylion007
2025-09-05 20:15:29 +00:00
01edcd4df8 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-05 20:15:11 +00:00
de893e96c7 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-05 20:15:11 +00:00
6087ef41e5 [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-05 19:59:51 +00:00
a3c7f77e50 [EZ][CD] Update MacOS deployment platform to 11.0 (#162264)
Fixes following warning
```
MACOSX_DEPLOYMENT_TARGET is set to a lower value (10.15) than the version on which the Python interpreter was compiled (11.0)
```
Update deployment platform in `README.MD` as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162264
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
ghstack dependencies: #162263
2025-09-05 19:58:04 +00:00
3771380f83 [ONNX] Hide draft export under a flag (#162225)
Use `TORCH_ONNX_ENABLE_DRAFT_EXPORT` to control whether draft_export should be used as a strategy in onnx export.

Follow up of https://github.com/pytorch/pytorch/pull/161454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162225
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
2025-09-05 19:54:50 +00:00
adae7f66aa Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit c37103234afc832dcad307e9016230810957c9d5.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
70f865ac9b Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
88d94d17e8 Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further (#161590)
This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161590
Approved by: https://github.com/albanD
ghstack dependencies: #161466, #161586
2025-09-05 18:43:41 +00:00
c321111499 [inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)
\# why

- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
  generators are executed, so we don't need this generator anymore

\# what

- rather than yielding the ChoiceCaller, just return the list of all
  valid ChoiceCallers

\# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346, #161347
2025-09-05 18:02:53 +00:00
9a8d454c46 [inductor] add kernel template choice (ktc) (#161347)
# why

- gather everything up to make choices, without running
  potentially expensive generators
- enables overrides where we toss the entire list of configs
  from inductor, without having to enumrate it (expensive)

# what

- add a holding class that just gets all the components necessary
  to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
  the scene

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346
2025-09-05 18:02:53 +00:00
e02e9edb55 [inductor] V.choice.get_mm_configs takes a stack of templates (#161346)
# why

- enables us to just gather relevant templates and get all
  choices at once
- that in turns allows us to make op wide override decisions

# what

- V.choice.get_mm_configs takes a stack of templates
- all callsites just provide a stack of size 1 right now
  but do not merge everything yet (other features pending)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520583](https://our.internmc.facebook.com/intern/diff/D81520583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161346
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345
2025-09-05 18:02:46 +00:00
d63ad53a99 [inductor][ez] return choicecallers directly (#161345)
# why

- remove repeat patterns
- we have everything to make the choicecallers
  - templates
  - input_nodes
  - layouts
  - all the kwargs

# what

- yield a choicecaller directly from V.choices.get_mm_configs

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520577](https://our.internmc.facebook.com/intern/diff/D81520577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161345
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344
2025-09-05 18:02:38 +00:00
031d79cb51 [inductor] move max-autotune logic inside V.choices.get_mm_configs (#161344)
# why

- heuristics providers know decide whether to (or which choices to add)
  in the max-autotune case
- enables an eventual override point to gracefully fallback to the
  standard behavior

# what

- max-autotune is determined inside V.choices.get_mm_configs
  because it's mm only right now, we can just do
  `config.max_autotune or config.max_autotune_gemm`
  a TODO indicates that this can change in the future when this
  expands to more templates

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520573](https://our.internmc.facebook.com/intern/diff/D81520573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161344
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343
2025-09-05 18:02:30 +00:00
a301dc3b60 [inductor][ez] pass template rather than template.uid (#161343)
# why

- simpler interface
- enables future of extracting more things out of the template e.g. a
  hash

# what

V.choices.get_mm_configs now takes the whole template rather than just
the template.uid

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520576](https://our.internmc.facebook.com/intern/diff/D81520576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161343
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342
2025-09-05 18:02:22 +00:00
af590cb729 [inductor][aten] treat like a template in GEMMs (#161342)
# why

- central point to analyze and override all generated choices

# what

- add a pseudo heuristic for aten that just yields a single, empty
  kwargs
- add a pseudo heuristic with the bias_addmm logic for it
- add an addmm specific heuristic that yields a single choice, but
  also expands it with alpha and beta kwargs

- replace all the aten.bind calls with V.choices.get_mm_configs
  using the now matching API for aten

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520580](https://our.internmc.facebook.com/intern/diff/D81520580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161342
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341
2025-09-05 18:02:10 +00:00
4902c76c65 [inductor][ez] add template/externchoice uid (#161341)
# why

- to have a central registry of templates/externkernelchoice
  to match them to heuristics etc, they need unique names
- mm is both the triton template name and the aten_mm name

# what

- add a uid() to KernelTemplate/ExternKernelChoice that returns name
- override in ExternKernel to prepend "aten::"
- override in TritonTemplate to prepend "triton::"

This id is just use to find template heuristics, so it has no other
impact

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075, #161340
2025-09-05 18:01:58 +00:00
9602590b15 [inductor] move scaled_mm input nodes logic (#161340)
# why

- a step towards a unified interface for all choices, where any
  adjustment to nodes (e.g. unsqueezing) happens as part of
  choice specific preprocessing, behind a common point

# what

- move the unsqueeze logic for triton nodes for scaled_mm inside
  the new hookup for adjusting the kernel inputs for template
  heuristics

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "scale"
```

Differential Revision: [D81520582](https://our.internmc.facebook.com/intern/diff/D81520582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161340
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075
2025-09-05 18:01:44 +00:00
2ef665ae19 [inductor][contigous mm] mild refactor (#162075)
# why

- use the new heuristics logic better to handle kwargs

# what

- move all checks into the heuristics to yield a single choice or not
  choices if the decomposition should not be used
- fix `hip` device type, which should be `cuda`
- let heuristics handle the kwarg passing

# testing

in ci

Differential Revision: [D81706776](https://our.internmc.facebook.com/intern/diff/D81706776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162075
Approved by: https://github.com/exclamaforte, https://github.com/jansel
2025-09-05 18:01:07 +00:00
b18bb6796f Add const to stable amax (#162082)
Fixes https://github.com/pytorch/pytorch/issues/161826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162082
Approved by: https://github.com/soulitzer
2025-09-05 17:37:49 +00:00
d711f27845 Revert "[ROCm] [CK] Composable Kernel integration for inductor backend (#158747)"
This reverts commit 019fed39aa6b2dd8c69347378d53423e5efae8d4.

Reverted https://github.com/pytorch/pytorch/pull/158747 on behalf of https://github.com/jithunnair-amd due to Broke linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test: 019fed39aa/1 ... PR didn't have this job run successfully due to CI outage ([comment](https://github.com/pytorch/pytorch/pull/158747#issuecomment-3259212343))
2025-09-05 17:27:45 +00:00
261a84a176 [CD][BE] Remove unnecessary checks for XCode version (#162263)
None of them have worked for a while, PyTorch for Mac is build with
XCode-15.4
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162263
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-05 17:02:36 +00:00
98374612fc [Intel GPU] Update Intel triton commit pin to Triton 3.5.x (#161777)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161777
Approved by: https://github.com/EikanWang
2025-09-05 16:55:47 +00:00
c2a3024617 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-05 16:55:09 +00:00
b2c7b9ad2d [Intel GPU][FlexAttention] Enable TMA path on Intel GPU (#162138)
The existing `can_use_tma` has some conditions that are unnecessary for Intel GPUs.
We have removed these useless conditions on the Intel GPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162138
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf
2025-09-05 16:54:51 +00:00
f3cebec39e Revert "Rename propagate_tensor_meta to make private again (#161744)"
This reverts commit 734ce8eba9c69381f187359bf0fef1d71d84cd20.

Reverted https://github.com/pytorch/pytorch/pull/161744 on behalf of https://github.com/jeanschmidt due to seems to break internal tests, see D81657000 for more details ([comment](https://github.com/pytorch/pytorch/pull/161744#issuecomment-3258934519))
2025-09-05 16:20:29 +00:00
06da7c0730 [DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202)
Summary:
Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13

We will use FP32 for the scale vector multiplication and convert to the target dtype.

Upcasting helps with the following:

1.  **Full CPU support**: `float32` has complete CPU kernel implementations for all operations
2.  **Numerical stability**: `float32` provides more precision during intermediate calculations
3.  **Compatibility**: Works across all devices (CPU/GPU) and PyTorch versions

Test Plan:
UTs

Rollback Plan:

Differential Revision: D81711093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202
Approved by: https://github.com/wwwjn
2025-09-05 16:06:21 +00:00
2dd529df00 A basic CLAUDE.md based on bad things I see claude code doing (#162163)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162163
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-05 14:52:36 +00:00
a714437093 [ez][inductor] add a few outer dimension reduction cases for LOAF (#162028)
For the not able to fuse issue reported here: https://github.com/pytorch/pytorch/issues/93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162028
Approved by: https://github.com/jansel, https://github.com/eellison
2025-09-05 09:30:13 +00:00
bffc7dd1f3 [CD] Add cuda 13.0 libtorch builds, remove CUDA 12.9 builds (#161916)
Related to https://github.com/pytorch/pytorch/issues/159779

Adding CUDA 13.0 libtorch builds, followup after https://github.com/pytorch/pytorch/pull/160956
Removing CUDA 12.9 builds, See https://github.com/pytorch/pytorch/issues/159980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161916
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-09-05 07:47:54 +00:00
5c473e9f5e [1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- skip some test cases which Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-05 05:52:15 +00:00
5da573c42c [PGO] handle PGO profile merges (#162097)
Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe.

Differential Revision: D81299840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097
Approved by: https://github.com/bobrenjc93
2025-09-05 04:58:15 +00:00
494878a11b [audio hash update] update the pinned audio hash (#162114)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162114
Approved by: https://github.com/pytorchbot
2025-09-05 04:32:16 +00:00
3bbc2e3e4f [vllm hash update] update the pinned vllm hash (#162226)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162226
Approved by: https://github.com/pytorchbot
2025-09-05 04:32:08 +00:00
b67c410398 [BE] [Inductor] Add Kernel name to all coor-desc tuning (#161409)
Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel.

Test Plan:
NFC. Just a logging change.

Rollback Plan:

Differential Revision: D80942794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409
Approved by: https://github.com/PaulZhang12
2025-09-05 02:53:13 +00:00
be5b03dde9 Allow for using a dedicated binary for the torch subproc pool. (#162093)
Summary:
The binary torch is running inside of can be larger than needed and in certain
situations, this can cause a loss of memory.

Test Plan:
We've manually run tests via
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0
make mc8-train-publish-cint-datafm-toy -C
minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 | tee ~/run_out
```
and overriding the binary used to be the built fbpkg in /packages.

We've also kicked off manual runs at
```
fire-feid-20250903-1051-ae8c6827
```

Which do show the binary running -  https://fburl.com/scuba/procprint/e6lwv32m

Rollback Plan:
steps:
  - jk.update:
      jk: pytorch/compiler:subproc_worker_binary
      constant_bool: null
      consistent_pass_rate: null
      fractional_host_rollout: null
      sampling_rate: null
  - manual.note:
      content: ''

Differential Revision: D81616624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093
Approved by: https://github.com/masnesral
2025-09-05 01:43:46 +00:00
73eb4511fb [B200][NVFP4] Fix argument passing in test_blockwise_mxfp8_nvfp4_mxfp4_numerics_ (#162185)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162185
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-05 01:24:59 +00:00
29280864d9 Add new parameter for gen_pyi.py to make it more configureable. (#161772)
This is a reposting of PR #128519.
This change is important to how we maintain PyTorch at Google.

From the previous PR:
"
This will make the script more flexible for the directory where it is executed.
...
We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access.
"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772
Approved by: https://github.com/albanD

Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
2025-09-05 00:48:15 +00:00
5c67426d68 [dynamo] Add support for const prop on .item (#162204)
Fixes some of the errors in https://fb.workplace.com/groups/1028545332188949/permalink/1303030824740397/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162204
Approved by: https://github.com/williamwen42
2025-09-05 00:28:49 +00:00
d2d4c8e9b2 [BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)
Followup after https://github.com/pytorch/pytorch/pull/154012

Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999
Approved by: https://github.com/drisspg
2025-09-04 23:35:27 +00:00
c7e41071a0 [B200][MXFP8] Fix regex in test_blockwise_mxfp8_nvfp4_error_messages_recipe_mxfp8_cuda (#162180)
to unblock https://github.com/pytorch/pytorch/pull/159494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162180
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/nWEIdia
2025-09-04 23:29:10 +00:00
9499c8761c [Inductor][Intel GPU] Register triton template heuristic for addmm tma. (#162132)
Fixes #162048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162132
Approved by: https://github.com/jansel
2025-09-04 23:01:57 +00:00
3a207816cc Forward fix for user defined triton kernel grid calc (#162162)
Summary:

This change fixes the test: inductor:fxir_backend - test_custom_triton_autotune_dynamic which was broken by https://github.com/pytorch/pytorch/pull/160997

Test Plan:
inductor:fxir_backend - test_custom_triton_autotune_dynamic

Rollback Plan:

Differential Revision: D81679217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162162
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-04 22:51:23 +00:00
09be1890d7 [export] Fix torch.export.load with storage offset (#162172)
Summary: As titled

Test Plan:
CI

Rollback Plan:

Differential Revision: D81687701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162172
Approved by: https://github.com/angelayi
2025-09-04 22:50:33 +00:00
0d84ff3b78 [PGO] log add_extra_remote PGO to tlparse (#161751)
Summary: log when additional PGO profile is merged in, from added read key

Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D81284190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161751
Approved by: https://github.com/bobrenjc93
2025-09-04 22:47:03 +00:00
1ec2c15914 Revert "Fix Arm64 OSS pytorch build with FBGEMM (#161527)"
This reverts commit dbec08729fb9848bebed6048c63831b87170d061.

Reverted https://github.com/pytorch/pytorch/pull/161527 on behalf of https://github.com/malfet due to This breaks all Mac builds, see b04e922712/1 ([comment](https://github.com/pytorch/pytorch/pull/161527#issuecomment-3256034443))
2025-09-04 22:29:38 +00:00
b04e922712 Fix memory leak in AOTI when calling aoti_torch_as_strided (#162118)
Summary:
Fix memory leak in AOTI when calling `aoti_torch_as_strided`

If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak
```

Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")`

Differential Revision: D81640339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118
Approved by: https://github.com/angelayi
2025-09-04 22:17:06 +00:00
0d71a9dd5b fix incorrect interaction between DDPOptimizer and donated buffers (#160745)
This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test.

You can see more of a description of the problem in the code comments. But the TLDR is that:

* When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts
* We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per **AOT** graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i
* This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during **backward compilation**. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers

The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-04 21:57:27 +00:00
89d41d3f61 [SymmMem] Feed tensor.data_ptr instead of handle.buffer_ptr into kernels (#162193)
After MemPool support, `get_buffer_ptrs` points to base address of allocation segment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162193
Approved by: https://github.com/ngimel
2025-09-04 21:26:05 +00:00
9bdcee01f8 [SymmMem] Add root argument to broadcast op (#161090)
It was missing earlier. Also added range check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090
Approved by: https://github.com/fegin
2025-09-04 21:09:54 +00:00
b9ba612f7a [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-04 20:43:03 +00:00
d5b38410b5 Revert "[SymmMem] Add root argument to broadcast op (#161090)"
This reverts commit 3c0ff1b569c45cfa6935ad8031a9d4cf1551aa3f.

Reverted https://github.com/pytorch/pytorch/pull/161090 on behalf of https://github.com/jeanschmidt due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/161090#issuecomment-3255574093))
2025-09-04 20:42:31 +00:00
48bedd753d Revert "Fix usage of forwarding references (#161094)"
This reverts commit 1ebd70d0c0d562d3be9abdee2a21906584af7d99.

Reverted https://github.com/pytorch/pytorch/pull/161094 on behalf of https://github.com/jeanschmidt due to checking if revert will fix https://github.com/pytorch/pytorch/actions/runs/17470601839/job/49621447581 ([comment](https://github.com/pytorch/pytorch/pull/161094#issuecomment-3255541480))
2025-09-04 20:35:41 +00:00
a3d72b09ae Apply Triton tensor descriptor for flex-decoding for performance (#161643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161643
Approved by: https://github.com/drisspg
2025-09-04 20:10:41 +00:00
ef3be6726f Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-04 20:05:50 +00:00
95ee0bfea9 Revert "[nativert] triton runtime implementation (#161798)"
This reverts commit 3dde5d7f9bf80dd6623a712bc429e9e4302464b5.

Reverted https://github.com/pytorch/pytorch/pull/161798 on behalf of https://github.com/jeanschmidt due to introducing linting failures ([comment](https://github.com/pytorch/pytorch/pull/161798#issuecomment-3255412085))
2025-09-04 20:05:24 +00:00
dbec08729f Fix Arm64 OSS pytorch build with FBGEMM (#161527)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/4775

Without this change, Arm64 OSS pytorch build with FBGEMM failed with the following error.
Undefined symbols for architecture arm64:
  "fbgemm::FindMinMax(float const*, float*, float*, long long)", referenced from:
      at::native::fbgemm_linear_int8_weight_fp32_activation(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&) in QuantizedLinear.cpp.o
      at::native::fbgemm_linear_quantize_weight(at::Tensor const&) in QuantizedLinear.cpp.o
      PackedConvWeight<2>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o
      PackedConvWeight<3>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o
      at::Tensor PackedLinearWeight::apply_dynamic_impl<false>(at::Tensor, bool) in qlinear_dynamic.cpp.o
      at::Tensor PackedLinearWeight::apply_dynamic_impl<true>(at::Tensor, bool) in qlinear_dynamic.cpp.o
ld: symbol(s) not found for architecture arm64

This change fixed the issue by moving FindMinMax's implementation from QuantUtilsAvx2.cc to QuantUtils.cc. FindMinMax is a platform-agnostic function with AVX2-specific optimizations so conceptually it can be put in QuantUtils.cc.

Test Plan:
With this change, Arm64 OSS pytorch built successfully with FBGEMM enabled.

Rollback Plan:

Reviewed By: q10

Differential Revision: D81052327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161527
Approved by: https://github.com/q10
2025-09-04 20:01:13 +00:00
c3d54dea9f Revert "[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)"
This reverts commit 02c83f13348631d80aa23f57aaff6b7d1223bbdd.

Reverted https://github.com/pytorch/pytorch/pull/161999 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))
2025-09-04 19:56:48 +00:00
afa6e5604d Revert "[BE] Cleanup stale comments/copy from gemm (#162001)"
This reverts commit b40d9432be44a6b5974ee62e7d19c3c61c5ece37.

Reverted https://github.com/pytorch/pytorch/pull/162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](https://github.com/pytorch/pytorch/pull/161999#issuecomment-3255381925))
2025-09-04 19:56:48 +00:00
9e5247f51d Revert "[MPS] enable cat op for sparse (#162007)"
This reverts commit 2c03f0acc53ed13fe8ebfe809129f25996e009a0.

Reverted https://github.com/pytorch/pytorch/pull/162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](https://github.com/pytorch/pytorch/pull/162007#issuecomment-3255357336))
2025-09-04 19:49:44 +00:00
c37103234a Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-04 19:43:17 +00:00
3dde5d7f9b [nativert] triton runtime implementation (#161798)
Summary:
att
Test Plan:
ci
Rollback Plan:

Reviewed By: minjang

Differential Revision: D80828148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161798
Approved by: https://github.com/minjang, https://github.com/SherlockNoMad
2025-09-04 19:00:15 +00:00
1f51056bd6 [BE]: Update cpp-httplib submodule to 0.26.0 (#162181)
Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162181
Approved by: https://github.com/jansel
2025-09-04 18:59:32 +00:00
6b1900c22f [dynamo][hops] Remove const outputs from the speculated subgraph (#161355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161355
Approved by: https://github.com/zou3519
2025-09-04 18:52:01 +00:00
9480cdc0b6 Modified the docs to add example for torch.is_floating_point and torc… (#161951)
…h.is_complex.

The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values.

Fixes #161859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161951
Approved by: https://github.com/zou3519
2025-09-04 18:50:19 +00:00
eqy
6f7608d603 [cuDNN][SDPA] Enable cuDNN SDPA by default for SM 9.0, SM 10.0 (#162073)
for 2.9
🙏

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162073
Approved by: https://github.com/drisspg
2025-09-04 18:46:28 +00:00
d1a15abfdc export: add explicit decomposition for aten.expand_copy and unit test (#161688)
Fixes #161080
torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument.

I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161688
Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
2025-09-04 18:16:56 +00:00
33028597bf [dynamo] Make the MRO walk more narrow (#162105)
I dont have a failing test case but just saw an extra guard somewhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162105
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-09-04 17:54:33 +00:00
9eadb37cdd enable float32 and float16 in torch._grouped_mm fallback (#162059)
Summary:

Enables `torch.float32` and `torch.float16` options in
`torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`,
`mat_b`, and `out_dtype` are `torch.bfloat16`.

Saving for future PRs:
1. enabling testing on more platforms
2. supporting out_dtype != mat_a.dtype
3. opinfo
4. better compile support

Test Plan:

```bash
// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #161407, #161717
2025-09-04 17:48:52 +00:00
61fb632cfb move _grouped_mm fallback to composite explicit autograd (#161717)
Summary:

Moves the `torch._grouped_mm` fallback from cuda-only code to a place
where it can be used by multiple backends. Specifically:
1. make the fallback path and util functions reusable and move them to
   `ATen/native/GroupedMMUtils.h`
2. register a backend-agnostic kernel to composite explicit autograd key
3. refactor the grouped_mm tests to their own test case and enable CPU

At the end of this PR, here is the support matrix:
* CUDA SM90+: fast path with test coverage (no change)
* CUDA SM80+: fallback with test coverage (no change)
* CPU: fallback works, but without test coverage (new in this PR)
* other SM versions and other backends: will probably already work, but
  let's leave this to future PRs
* float32/float16: will probably already work, but let's leave this to
  future PRs

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717
Approved by: https://github.com/ngimel, https://github.com/drisspg
ghstack dependencies: #161407
2025-09-04 17:48:52 +00:00
8a736fa1ea create torch._grouped_mm fallback path with for loops / bmm (#161407)
Summary:

Creates a fallback path for `torch._grouped_mm`, using the naive for
loop implementation (or bmm).

For the sake of keeping the PR small, this PR only enables SM80+ (CUDA
capability 8.0 and up), since I am testing this on an A100 machine. In
future PRs, we can increase the coverage of the fallback to:
1. float32 and float16, which will extend the GPU coverage
2. cpu

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-09-04 17:48:44 +00:00
8bb213b6d5 [SymmMem] Increase signal pad size for NVL72 (#162026)
so that the signal calls do not step on each other's foot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162026
Approved by: https://github.com/ngimel
2025-09-04 17:41:38 +00:00
869cbcc16e [SymmMem] Add a helper API to distinguish intra- and inter- node (#161984)
Added a helper API to tell if the world is entirely within a P2P domain or crosses network.
This is mainly for nblocks tuning purpose. (In later PRs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161984
Approved by: https://github.com/ngimel
ghstack dependencies: #161983
2025-09-04 17:37:59 +00:00
0c0e056a9e [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel, https://github.com/eqy

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-04 17:21:26 +00:00
f36f285953 [dynamo] change error_on_graph_break/fullgraph semantics (#161747)
This PR implements the semantics change to `torch._dynamo.error_on_graph_break`:
- ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~
- `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks.
- `error_on_graph_break` does nothing when `fullgraph=True`
- `error_on_graph_break` does NOT guarantee a single graph

Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation:
- `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled
- `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time
- `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747
Approved by: https://github.com/mlazos
ghstack dependencies: #161739
2025-09-04 17:10:17 +00:00
ba7f546ccc Update torch-xpu-ops commit pin (#162062)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](83c5a5a551), includes:

- Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed
- Fallback lu_factor kernel to CPU for single batch
- Enable aten::linalg_inv and aten::linalg_inv_ex on XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062
Approved by: https://github.com/EikanWang
2025-09-04 17:05:33 +00:00
43b7c86a2c Add dependency-groups.dev to pyproject.toml (#161216)
[PEP 735](https://peps.python.org/pep-0735) introduces the
[dependency-groups] table for a number of use-cases one of
which includes specifying development dependencies for projects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161216
Approved by: https://github.com/seemethere
2025-09-04 16:51:36 +00:00
019fed39aa [ROCm] [CK] Composable Kernel integration for inductor backend (#158747)
This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality.

The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files.

This PR is a remake of due to branch error: https://github.com/pytorch/pytorch/pull/156192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158747
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-04 16:51:06 +00:00
81aeefa657 Add torch.compile support for triton.constexpr_function (#162106)
Fixes #161868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162106
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-09-04 16:46:55 +00:00
248355faf5 Don't require FakeStore to be passed into fake backend (#162164)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162164
Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab
2025-09-04 16:43:49 +00:00
1ebd70d0c0 Fix usage of forwarding references (#161094)
I found a number of places that seem to want forwarding
references but the type signature does not reflect that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161094
Approved by: https://github.com/malfet
2025-09-04 16:34:39 +00:00
cc5bdd1240 Keep default CMAKE_PREFIX_PATH in test_aot_inductor_package (#161907)
`CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found.

Instead prepend the path to the existing value.

This fixes a test failure:
> pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package
>    self.assertTrue(so_path.exists())
> AssertionError: False is not true

Caused by:
```
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory
collect2: error: ld returned 1 exit status
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161907
Approved by: https://github.com/Skylion007
2025-09-04 16:27:57 +00:00
3a20a20e70 Fix largeTensorTest malfunction on XPU (#161988)
# Motivation
https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-09-04 16:10:03 +00:00
6b8b3ac440 Revert "[ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)"
This reverts commit cd529b686d54bbaa443f5b310140de48422d96c7.

Reverted https://github.com/pytorch/pytorch/pull/162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](https://github.com/pytorch/pytorch/pull/162044#issuecomment-3254427869))
2025-09-04 16:06:30 +00:00
601ae8e483 [CUDAGraph] add config to error on skipping cudagraph (#161862)
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.

This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-09-04 15:52:39 +00:00
b7dad7dd49 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit 90b08643c3a6eb1f3265b7d1388bd76660759f46.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))
2025-09-04 15:25:07 +00:00
e532c9d4f1 Relax tolerance for test_quick_baddbmm_cpu_complex64 (#152424)
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`.

I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb

The greatest differences are consistent and the same on both CPU architectures:
```
Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed)
```

Hence I assume this is in the expected tolerances  especially as `complex128` and all other types pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152424
Approved by: https://github.com/malfet
2025-09-04 13:26:42 +00:00
34aa78274d Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))
2025-09-04 13:13:52 +00:00
040d00af04 [2/N]Port several test files under test/distributed to Intel GPU (#159473)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  Added xpu for Backend.backend_capability and dist.Backend.register_backend()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-04 12:53:17 +00:00
9c957723a0 Replace setup.py develop with pip install -e (#156710)
#156027 already replaced most use of `python setup.py develop`. This PR only adds a few more occurrences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156710
Approved by: https://github.com/atalman
2025-09-04 11:07:44 +00:00
acece97c3a [Intel GPU] Upgrade OneDNN XPU Tag to v3.9.1 (#161932)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161932
Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey
2025-09-04 11:05:10 +00:00
ea1883dfd3 Fixes #154982: add missing to_result_dtype in vector_norm (#155111)
Fixes #154982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155111
Approved by: https://github.com/isuruf, https://github.com/eellison
2025-09-04 10:49:08 +00:00
d67c29ad22 [inductor] Fix int64 from MutationOutput Buffer (#162020)
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Differential Revision: D81530083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
2025-09-04 09:47:57 +00:00
09587daf8c Adding missing example of torch.full_like Issue#161899 (#162051)
Fixes #161899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162051
Approved by: https://github.com/zou3519
2025-09-04 08:45:49 +00:00
c024b1f5a1 [AMD] [Reland] Fix AMD User Defined Kernel Autotune (#161521)
Summary: This is a reland of D80285441, fixed the unit test.

Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR

```
will succeed after this diff.

Rollback Plan:

Differential Revision: D80971224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521
Approved by: https://github.com/frank-wei
2025-09-04 08:41:18 +00:00
8fd3c9ce91 Optimize AMP custom_backend_name error message (#162037)
Print out amp target dtype and let custom backend easier find out expected dtype while integration.

## Test Result

### Before
```python
In [1]: import torch
   ...: import torch_openreg
   ...:
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(4, 2)
   ...: with torch.autocast("openreg", dtype=torch.float16):
   ...:     torch.mm(a, b)
   ...:
/home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype is not supported. Disabling autocast.
 openreg Autocast only supports dtypes of torch.float32 currently.
  warnings.warn(error_message
```

### After
```python
In [1]: import torch
   ...: import torch_openreg
   ...:
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(4, 2)
   ...: with torch.autocast("openreg", dtype=torch.float16):
   ...:     torch.mm(a, b)
   ...:

/home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype torch.float16 is not supported. Disabling autocast.
 openreg Autocast only supports dtypes of torch.float32 currently.
  warnings.warn(error_message)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162037
Approved by: https://github.com/zou3519
2025-09-04 08:27:56 +00:00
e19e02c84c port distributed tensor test files for Intel GPU (#161604)
In this pr, we port test/distributed/tensor test filesfor Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161604
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-04 07:49:25 +00:00
69a25f6888 [ROCm] Enable USE_FBGEMM_GENAI (#160676)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/4703

X-link: https://github.com/facebookresearch/FBGEMM/pull/1728

In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.

Rollback Plan:

Differential Revision: D79564024

Test Plan:

Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](9491d289b3/.ci/docker/libtorch/build.sh (L48))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160676
Approved by: https://github.com/drisspg
2025-09-04 07:13:17 +00:00
890626632d [DLPACK] Optimize toDLPack Conversion Speed (#162111)
Previously in gh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.

Fixes #162113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162111
Approved by: https://github.com/msaroufim
2025-09-04 05:27:05 +00:00
480c739112 Capture TypeError in CONTAINS_OP (#161069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161069
Approved by: https://github.com/anijain2305
2025-09-04 04:49:09 +00:00
66f3b4a682 Contiguous subgraph decomposition (#161241)
## Summary

Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.

## Background

On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
  ))
```
is a lot slower than:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
  ))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.

## Data

I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
  addmm_16448x512x2048: +0.14%
  addmm_128x2048x2048: +0.01%
  addmm_128x768x1000: +0.75%
  addmm_12672x3072x768: +1.08%
  addmm_512x768x32000: +0.62%
  addmm_12608x384x384: +0.00%
  addmm_4160x1024x4096: +0.90%
  addmm_16x768x2: +0.56%
  addmm_12608x3072x768: +0.09%
  addmm_64x4096x1000: +2.77%
  addmm_256x1024x512: +1.99%
  addmm_30x256x256: +1.12%
  addmm_100480x128x384: +0.91%
  addmm_6400x2048x512: +0.25%
  addmm_61568x1024x256: +0.08%
  addmm_1x768x768: +0.93%
  addmm_12544x384x384: +0.19%
  addmm_128x512x1000: +0.77%
  addmm_2048x128x128: +1.32%
  addmm_128x3072x1000: +0.24%
  addmm_7936x512x2048: +0.07%
  addmm_8192x512x2048: +0.33%
  addmm_64x1024x1000: +1.43%
  addmm_128x2304x1000: +0.01%
  addmm_32768x256x2: +0.75%
  addmm_64x384x1152: +0.79%
  addmm_64x640x1000: +0.01%
  addmm_100480x128x128: +0.87%
  addmm_1152x3072x768: +1.13%
  addmm_8192x256x2048: +1.40%
  addmm_4096x128x768: +0.01%
  addmm_128x2560x1000: +0.01%
  addmm_12544x2048x512: +0.43%
  addmm_200704x24x96: +0.14%
  addmm_8448x512x2048: +0.96%
  addmm_50176x256x1024: +0.62%
  addmm_4160x4096x1024: +0.22%
  addmm_4096x768x768: +0.32%
  addmm_220x2048x512: +0.56%
  addmm_8x2048x1000: +1.12%
  addmm_256x197951x512: +26.99%
  addmm_401536x64x192: +0.60%
  addmm_2040x2048x512: +0.47%
  addmm_512x1024x256: +1.32%
  addmm_128x4096x1000: +1.67%
  addmm_12672x768x768: +0.34%
  addmm_128x368x1000: +0.77%
  addmm_96x1280x1000: +0.01%
  addmm_12544x512x2048: +0.41%
  addmm_6272x320x1280: +0.76%
  addmm_12544x3072x768: +0.09%
  addmm_64x384x1000: +0.39%
mm improvements when best:
  mm_200704x128x512: +1.29%
  mm_663552x16x16: +0.80%
  mm_4096x768x768: +0.51%
  mm_131072x64x31: +0.24%
  mm_12544x1152x384: +0.11%
  mm_128x2048x2: +0.46%
  mm_262144x16x23: +0.62%
  mm_50176x576x192: +0.37%
  mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%

Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)

Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%

```

## Results

The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```

It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
  addmm_128x16384x128: +0.18%
  addmm_128x262144x256: +38.24%
  addmm_128x200000x512: +14.76%
  addmm_256x800000x128: +0.06%
  addmm_131072x128x256: +0.27%
  addmm_128x256x131072: +0.25%
  addmm_2048x200000x64: +12.45%
mm improvements when best:
  mm_128x16384x128: +0.18%
  mm_128x262144x256: +38.05%
  mm_128x200000x512: +9.47%
  mm_256x800000x128: +0.99%
  mm_512x6400000x256: +3.17%
  mm_524288x64x64: +0.29%
  mm_2048x200000x64: +11.19%
  mm_8192x1000000x256: +34.14%
  mm_128x4096x100000: +0.40%
  mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%

Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%

```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.

Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866

## Test Plan:
New unit tests.

Differential Revision: D80771648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241
Approved by: https://github.com/eellison
2025-09-04 04:43:58 +00:00
302df2ac5d [vllm hash update] update the pinned vllm hash (#162115)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162115
Approved by: https://github.com/pytorchbot
2025-09-04 04:26:34 +00:00
dec72ea4b0 [reland] Add inductor provenance mapping for cpp extern kernel (#161656) (#162069)
Summary:

Add inductor provenance mapping for cpp extern kernel

Test Plan:
```
buck run fbcode//caffe2/test/inductor:provenance_tracing --  -r test_cpu_extern_kernel
```

Differential Revision: D81598857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162069
Approved by: https://github.com/angelayi
2025-09-04 04:18:43 +00:00
8975cda252 [pt] strip error messages in profile builds (#162076)
Summary: Profile builds should match production builds, and error messages result in large static initializers running. Omit them for profile builds too.

Test Plan:
Before:
```
$ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output
$ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a | grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9
0000000000003234 T __ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9_
```

After:
```
$ buck build //xplat/caffe2:aten_native_cpuApple -c user.sandcastle_build_mode=profile --show-output
$ llvm-nm buck-out/v2/gen/fbsource/31fc3668aa0b4012/xplat/caffe2/__aten_native_cpuApple__/libaten_native_cpuApple.pic.a | grep ZN3c106detail12_str_wrapperIJPKcRKiS3_RKxS3_RKS3_S3_EE4callES9_S5_S9_S7_S9_S9_S9
```

Rollback Plan:

Reviewed By: yury-dymov, abashyam

Differential Revision: D81599582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162076
Approved by: https://github.com/swolchok
2025-09-04 04:18:27 +00:00
d636c181f9 Fix range.__getitem__() (#161804)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161804
Approved by: https://github.com/anijain2305
ghstack dependencies: #161801, #161802, #161803
2025-09-04 02:33:03 +00:00
c8255c67cd redirect iter(range) to range.__iter__() (#161803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161803
Approved by: https://github.com/anijain2305
ghstack dependencies: #161801, #161802
2025-09-04 02:33:03 +00:00
485a7bd82e Add range_count and range.__contains__ (#161802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161802
Approved by: https://github.com/anijain2305
ghstack dependencies: #161801
2025-09-04 02:33:03 +00:00
1ef7efa592 Add range_equals (#161801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161801
Approved by: https://github.com/anijain2305
2025-09-04 02:33:03 +00:00
57278d45f0 [Quant][Inductor][CPU] add qconv int8-mixed-bf16 patterns (#161487)
Summary:
Expand the patterns supported by qconv weight prepack, Specifically, expand the conv patterns of int8-mixed-bf16 datatype to support the following two cases:
Case 1:
the `out_dtype `of `dequantize_per_tensor  `is `torch.float32`

```
    dq_per_tensor  dq_per_channel
         |               |
    to_bf16           to_bf16
            \          /
             Conv2d
```

Case 2:
the `out_dtype `of `dequantize_per_tensor  `is `torch.bfloat16`

```
    dq_per_tensor  dq_per_channel
         \               |
                      to_bf16
                       /
             Conv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161487
Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel
ghstack dependencies: #161486
2025-09-04 02:01:34 +00:00
cec0ff1228 [Quant][Inductor][CPU] add qlinear int8-mixed-bf16 patterns (#161486)
Summary:
Expand the patterns supported by qlinear weight prepack, Specifically, expand the linear patterns of int8-mixed-bf16 datatype to support the following two cases:
Case 1:
the `out_dtype` of `dequantize_per_tensor ` is `torch.float32`

    dq_per_tensor  dq_per_channel
         |               |
    to_bf16           to_bf16
         |               |
     OPT(reshape)     permute
            \          /
             addmm/mm
                    |
           OPT(reshape)

or

    dq_per_tensor  dq_per_channel
         |               |
    to_bf16           to_bf16
         |               |
       expand         permute
          \              |
                      expand
                       /
               bmm
                |
            OPT(add)

Case 2:
the `out_dtype` of `dequantize_per_tensor ` is `torch.bfloat16`

    dq_per_tensor  dq_per_channel
         |               |
                       to_bf16
                         |
     OPT(reshape)   permute
            \          /
             addmm/mm
                    |
           OPT(reshape)

or

    dq_per_tensor  dq_per_channel
         |                |
                        to_bf16
                          |
       expand          permute
          \               |
                        expand
                        /
               bmm
                |
            OPT(add)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161486
Approved by: https://github.com/Xia-Weiwen, https://github.com/jansel
2025-09-04 01:53:02 +00:00
65985937d9 expose number of outputs in native runtime for unified runtime (#161723)
This is only user outputs which is what we want. Spoke to @zhxchen17 though and it seems like nativeRT might have some bugs on propogating updates to things like input mutation or buffer mutation though. Something to take a look at in a follow up.

Also I have no idea where the nativeRT tests are. Any pointers @zhxchen17  @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161723
Approved by: https://github.com/zhxchen17
2025-09-04 01:20:31 +00:00
fbf3d2027d use sym_or instead of any to avoid dde in calc_conv_nd_return_shape (#162084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162084
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-09-04 01:20:22 +00:00
8678d831c4 [dynamo] rename set_fullgraph to error_on_graph_break (#161739)
Renaming `set_fullgraph` to `error_on_graph_break` for now. There are no semantic differences yet. In a followup PR, we will introduce a new `torch.compile` option `error_on_graph_break` that has lower priority than `fullgraph` so that `fullgraph` really returns 1 graph.

I could keep `set_fullgraph` as a deprecated alias for `error_on_graph_break` for now, but I'm hoping that won't be necessary since it's still private API (there are no internal callsites yet, and there are no significant OSS callsites yet).

 cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela @mlazos @guilhermeleobas @xmfan as primary users for `set_fullgraph`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161739
Approved by: https://github.com/xmfan, https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/mlazos
2025-09-04 01:15:06 +00:00
1281470155 [DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682)
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency

Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D80174674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
2025-09-04 01:09:53 +00:00
9458d1ac3b [inductor] pdl inductor option (disabled by default) (#160928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160928
Approved by: https://github.com/eellison
2025-09-04 00:35:23 +00:00
3c45af079a kill allow_complex_guards_as_runtime_asserts (#161794)
Summary:
[reland]
Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D81334984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161794
Approved by: https://github.com/zhxchen17
2025-09-04 00:17:01 +00:00
aad96a2022 Revert "Contiguous subgraph decomposition (#161241)"
This reverts commit d64718503728001a1e78168fd7f2d4ff23e57285.

Reverted https://github.com/pytorch/pytorch/pull/161241 on behalf of https://github.com/jeffdaily due to breaks rocm mi300 tests ([comment](https://github.com/pytorch/pytorch/pull/161241#issuecomment-3251185098))
2025-09-04 00:14:22 +00:00
5f3cbc9442 fixed typo error (#162055)
Fixes #162054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162055
Approved by: https://github.com/RajeshvShiyal, https://github.com/malfet
2025-09-04 00:06:58 +00:00
a918bbad6a [inductor] fix test output path 2 (#162085)
Fix test_output_path_2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162085
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-09-04 00:03:47 +00:00
8ec551bb35 [aot-compile] strip internal tracebacks for non-verbose graph breaks + include user file/lineno (#162005)
pytest test/dynamo/test_aot_compile.py -k test_aot_compile_graph_break_error_fmt

before
```
Traceback (most recent call last):
  File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module>
    aot_compiled_fn = compiled.aot_compile((example_inputs, {}))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 717, in aot_compile
    return aot_compile_fullgraph(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/aot_compile.py", line 132, in aot_compile_fullgraph
    capture_output = convert_frame.fullgraph_capture(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 947, in fullgraph_capture
    dynamo_output = compile_frame(
                    ^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 1020, in compile_frame
    bytecode, tracer_output = transform_code_object(code, transform)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/bytecode_transformation.py", line 1592, in transform_code_object
    tracer_output = transformations(instructions, code_options)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 992, in transform
    tracer_output = trace_frame(
                    ^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 312, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 821, in trace_frame
    run_tracer()
  File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 803, in run_tracer
    tracer.run()
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1472, in run
    while self.step():
          ^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1342, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 902, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3364, in CALL
    self._call(inst)
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3358, in _call
    self.call_function(fn, args, kwargs)
  File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1260, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/variables/functions.py", line 1513, in call_function
    unimplemented_v2(
  File "/data/users/$USER/pytorch/torch/_dynamo/exc.py", line 596, in unimplemented_v2
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html
```
after
```
Traceback (most recent call last):
  File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module>
    aot_compiled_fn = compiled.aot_compile((example_inputs, {}))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 737, in aot_compile
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html

from user code:
   File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo
    torch._dynamo.graph_break()

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```
consistent w/ std torch.compile
```
Traceback (most recent call last):
  File "/data/users/$USER/vllm-tests/graph-break.py", line 16, in <module>
    res = compiled(*example_inputs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 850, in compile_wrapper
    raise e.with_traceback(None) from e.__cause__  # User compiler error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
  Explanation: User-inserted graph break. Message: None
  Hint: Remove the `torch._dynamo.graph_break()` call.

  Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html

from user code:
   File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo
    torch._dynamo.graph_break()

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162005
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2025-09-03 23:19:47 +00:00
36d207fcaa [CI] viable strict upgrade: Explicitly name which linux binary wheels should block (#162100)
Reason:
rocm binary builds should not block viable strict upgrade.  It is queuing/canceled so viable strict is 1.2 days old

Tested by mangling the workflow file to get to the actual call of the python script `python ../test-infra/tools/scripts/fetch_latest_green_commit.py --required-checks '["pull", "trunk", "lint", "^linux-binary-manywheel$", "^linux-binary-libtorch-release$", "linux-aarch64"]' --viable-strict-branch viable/strict --main-branch master`, which I then ran locally where I have credentials.  It returned d64718503728001a1e78168fd7f2d4ff23e57285 which is green.  Without this change, it returns 5e5870e858f60ff4bf87d03f3592097e934a9580, which is pretty old

The other solution would have been to mark it as unstable I think

Side note, why is it master and how is it working like that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162100
Approved by: https://github.com/huydhn
2025-09-03 22:38:32 +00:00
99f356fa58 [ROCm] revamp miopen integration (#161687)
Update sources under ATen/miopen and ATen/native/miopen to align with best practices. Avoid reshape_ calls inside backward operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161687
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 22:28:09 +00:00
0af70e2353 Modify ROCm MI2xx-based workflows to run on cron schedule (#162103)
To mitigate queueing on MI2xx runners since Cirrascale runners are offline. Match cron schedule of periodic.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162103
Approved by: https://github.com/jeffdaily, https://github.com/seemethere
2025-09-03 21:51:03 +00:00
b1bb98ddeb [ROCm] TunableOp should use HIP version, not ROCm version (#162067)
Fixes #160874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162067
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 21:42:23 +00:00
abc447174c [PP] Add profiling to schedule execution (#160753)
Profiling title will be `str(action)`

<img width="1545" height="694" alt="image" src="https://github.com/user-attachments/assets/60b3506b-b8d6-4ae0-8b32-0d51d45fa2f0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160753
Approved by: https://github.com/wconstab
2025-09-03 21:31:50 +00:00
734ce8eba9 Rename propagate_tensor_meta to make private again (#161744)
Rename the wrapper `propagate_tensor_meta` added in #161334 to make it clearly private, and rename the existing LRU function to accommodate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161744
Approved by: https://github.com/bdhirsh
2025-09-03 21:11:45 +00:00
98efc9e93d [ROCm] Bump AOTriton to 0.11b (#161754)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:

* Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
  - AITER ASM kernels deliver over 500TFLOPS training performance. See
    [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more
    details.
* Now returns natural based `logsumexp` tensor, matching CUDA's behavior
  - PR #156903 is reverted in this PR as well since it is not needed anymore.
* Enables `CausalVariant.LOWER_RIGHT`

The build system changes drastically along with new packaging scheme of
AOTriton 0.11

* AOTriton 0.11 packs GPU images separately from AOTriton runtime
* `aotriton.cmake` now selectively downloads image packs according to
  `PYTORCH_ROCM_ARCH`
* `aotriton.cmake` now only use pre-compiled runtime library that exactly
  matches the ROCM in the build environment. For PyTorch builds with ROCm
  versions not listed in the file, the build process will build AOTriton
  runtime without GPU images from source
  - This avoids any further ABI breaks like ROCM 6.4 -> 7.0
  - recursive git clone is disabled since building AOTriton runtime does not
    require submodules.

Bug fixes:

* Fix a kernel bug introduced when implementing SWA

Known Problems:

* gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
  due to accuracy issues. Triton compiler fixes are needed to restore the
  support status.
* Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
  This issue is under investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-09-03 20:45:44 +00:00
994f2a5dbc [SymmMem][CI] Make sure group names are consistent (#162035)
Unblocking #161741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162035
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-09-03 20:40:24 +00:00
d1706d9128 [Symmetric memory] set handle type for ROCm (#161741)
Fixes #161722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161741
Approved by: https://github.com/kwen2501
2025-09-03 20:33:35 +00:00
1aa7476885 fix to segmentation fault when empty tensor is passed to choose_qpara… (#161966)
…ms_optimized

Fixes #153326

Minimal code to reproduce error:
```
import torch

tensor = torch.tensor([])

torch.choose_qparams_optimized(
    tensor,
    0,
    200,
    0.16,
    8
)
```

Previous Output:
`Segmentation fault`

Now Output:
```
Traceback (most recent call last):
  File "/home/amaitra/work/tests/issue_153326.py", line 5, in <module>
    torch.choose_qparams_optimized(
RuntimeError: input tensor is empty and has no data
```

Caused because `const float* input_row =input_tensor.const_data_ptr<float>();` becomes null
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161966
Approved by: https://github.com/Skylion007
2025-09-03 20:26:26 +00:00
8e23a1227b [ROCm/Windows] Fix build failures and support some BLAS calls (#161981)
* Support getrsBatched/geqrfBatched/gelsBatched on Windows ROCm (fixes https://github.com/ROCm/TheRock/issues/1367)
* Fix windows pytorch build with USE_DISTRIBUTED=ON by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161981
Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 20:26:14 +00:00
850e1382a9 [hipify] Replace cudaStreamCaptureStatusNone (#161992)
Replacing additional cuda symbols to hip symbols

Differential Revision: D81420086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161992
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2025-09-03 20:23:32 +00:00
3c0ff1b569 [SymmMem] Add root argument to broadcast op (#161090)
It was missing earlier. Also added range check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161090
Approved by: https://github.com/fegin
2025-09-03 20:17:45 +00:00
c465b3d52c [2/n][export] Refactor PT2 Archive weight saving and loading (#161520)
Summary:
The saving (serialization) part of PT2 archive weight refactoring.
The loading (deserialization part) has been landed in D80035490

Test Plan:
CI

Rollback Plan:

bifferential Revision: D80970931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161520
Approved by: https://github.com/SherlockNoMad
2025-09-03 20:12:49 +00:00
f4c33cd44a [pt2e] Avoid getting model device once per node (#159901)
**Summary:** Previously, we call `assert_and_get_unqiue_device` once per node in both prepare and convert. This is expensive and unnecessary since the model device is the same across all nodes, so we should just call this once in the beginning and reuse the same model device across all the nodes.

**Test Plan:**
python test/test_quantization.py -k TestQuantizePT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159901
Approved by: https://github.com/jerryzh168
2025-09-03 19:29:00 +00:00
92576a594b Prototype for building non-strict leak detector (#160456)
Summary:
Our strategy for detecting fake tensor leakage in non-strict for outside scope (side effects happening outside of model.forward) is:
1. We do gc.collect() before export and get the alive fake tensors
2. We dump the proxy to fake tensor map from make_fx tracer
3. We query gc again to get alive fake tensors
4. We take the delta between (1) and (3)
5. Filter out fake tensors that are:
    1. Associated with `TrackedFake` (input tracking thing in symbolic_shapes)
    2. Associated with `gm.meta`
6. Do ID match with the proxies and emit their stacktraces.

We rely on (https://github.com/pytorch/pytorch/pull/159923) for other sources of leakages such as:
1. We failed to proxy an operator (like param.data)
2. We cache some tensor in model.forward (https://github.com/pytorch/pytorch/issues/155114)

In general, we notice `gc.collect()` and query-ing gc for live objects are kinda slow. So we turn on this feature under env variable. We should document on export public facing documents that if you run into weird errors regarding fake tensors, they should look into turning on this env variable for further analysis.

Test Plan:
Test plan

Rollback Plan:

Differential Revision: D80003204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160456
Approved by: https://github.com/pianpwk
2025-09-03 19:21:27 +00:00
cd529b686d [ROCm] Use MI325 (gfx942) runners for binary smoke testing (#162044)
### Motivation

* MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs:
<img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" />

* MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755).
* Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity.
* However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa

* Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR.

### TODOs (cc @amdfaa):
* Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step
* Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 18:34:07 +00:00
62c3f9a97f [inductor] Follow integer overflow rules in TypedExpr (#161922)
Fixes https://github.com/pytorch/pytorch/issues/161763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161922
Approved by: https://github.com/jansel
2025-09-03 18:33:18 +00:00
8076a185c8 Offload set method execution to CPython when possible (#160763)
Reduces CPython `test_set.py` runtime from 63.477s to 40.298s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160763
Approved by: https://github.com/anijain2305
2025-09-03 18:26:05 +00:00
f00445b43e [inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339)
# why

- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
  for codegen and heuristics

- unified place where we can just adjust them for the template

# what

- inside get_mm_configs, return not the passed in kernel inputs,
  but allow the template heuristic to adjust them if necessary

- the default implementation right now just passes them back

this diff just adds the functionality, but does not exercise it
other than the default (passthrough)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
2025-09-03 18:23:22 +00:00
3559c354ce stop suggesting using guard_size_oblivious on data dependent errors (#160510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160510
Approved by: https://github.com/ezyang
2025-09-03 18:07:59 +00:00
71992dd805 S390x: build nightly binaries for new pythons (#161920)
Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries

Fixes #161515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161920
Approved by: https://github.com/malfet
2025-09-03 17:38:38 +00:00
d647185037 Contiguous subgraph decomposition (#161241)
## Summary

Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.

## Background

On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
  ))
```
is a lot slower than:
```
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
  ))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.

## Data

I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
  addmm_16448x512x2048: +0.14%
  addmm_128x2048x2048: +0.01%
  addmm_128x768x1000: +0.75%
  addmm_12672x3072x768: +1.08%
  addmm_512x768x32000: +0.62%
  addmm_12608x384x384: +0.00%
  addmm_4160x1024x4096: +0.90%
  addmm_16x768x2: +0.56%
  addmm_12608x3072x768: +0.09%
  addmm_64x4096x1000: +2.77%
  addmm_256x1024x512: +1.99%
  addmm_30x256x256: +1.12%
  addmm_100480x128x384: +0.91%
  addmm_6400x2048x512: +0.25%
  addmm_61568x1024x256: +0.08%
  addmm_1x768x768: +0.93%
  addmm_12544x384x384: +0.19%
  addmm_128x512x1000: +0.77%
  addmm_2048x128x128: +1.32%
  addmm_128x3072x1000: +0.24%
  addmm_7936x512x2048: +0.07%
  addmm_8192x512x2048: +0.33%
  addmm_64x1024x1000: +1.43%
  addmm_128x2304x1000: +0.01%
  addmm_32768x256x2: +0.75%
  addmm_64x384x1152: +0.79%
  addmm_64x640x1000: +0.01%
  addmm_100480x128x128: +0.87%
  addmm_1152x3072x768: +1.13%
  addmm_8192x256x2048: +1.40%
  addmm_4096x128x768: +0.01%
  addmm_128x2560x1000: +0.01%
  addmm_12544x2048x512: +0.43%
  addmm_200704x24x96: +0.14%
  addmm_8448x512x2048: +0.96%
  addmm_50176x256x1024: +0.62%
  addmm_4160x4096x1024: +0.22%
  addmm_4096x768x768: +0.32%
  addmm_220x2048x512: +0.56%
  addmm_8x2048x1000: +1.12%
  addmm_256x197951x512: +26.99%
  addmm_401536x64x192: +0.60%
  addmm_2040x2048x512: +0.47%
  addmm_512x1024x256: +1.32%
  addmm_128x4096x1000: +1.67%
  addmm_12672x768x768: +0.34%
  addmm_128x368x1000: +0.77%
  addmm_96x1280x1000: +0.01%
  addmm_12544x512x2048: +0.41%
  addmm_6272x320x1280: +0.76%
  addmm_12544x3072x768: +0.09%
  addmm_64x384x1000: +0.39%
mm improvements when best:
  mm_200704x128x512: +1.29%
  mm_663552x16x16: +0.80%
  mm_4096x768x768: +0.51%
  mm_131072x64x31: +0.24%
  mm_12544x1152x384: +0.11%
  mm_128x2048x2: +0.46%
  mm_262144x16x23: +0.62%
  mm_50176x576x192: +0.37%
  mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%

Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)

Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%

```

## Results

The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```

It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
  addmm_128x16384x128: +0.18%
  addmm_128x262144x256: +38.24%
  addmm_128x200000x512: +14.76%
  addmm_256x800000x128: +0.06%
  addmm_131072x128x256: +0.27%
  addmm_128x256x131072: +0.25%
  addmm_2048x200000x64: +12.45%
mm improvements when best:
  mm_128x16384x128: +0.18%
  mm_128x262144x256: +38.05%
  mm_128x200000x512: +9.47%
  mm_256x800000x128: +0.99%
  mm_512x6400000x256: +3.17%
  mm_524288x64x64: +0.29%
  mm_2048x200000x64: +11.19%
  mm_8192x1000000x256: +34.14%
  mm_128x4096x100000: +0.40%
  mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================

Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%

Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%

```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.

Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866

## Test Plan:
New unit tests.

Differential Revision: D80771648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241
Approved by: https://github.com/eellison
2025-09-03 17:02:59 +00:00
eb18d32bda Add range_iterator (#161800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161800
Approved by: https://github.com/anijain2305
ghstack dependencies: #161799
2025-09-03 16:55:04 +00:00
889f01eb73 Add CPython test test_range (#161799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161799
Approved by: https://github.com/anijain2305
2025-09-03 16:55:04 +00:00
451ed93156 [inductor] fix split_aot_inductor_output_path on Windows. (#162058)
fix split_aot_inductor_output_path on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162058
Approved by: https://github.com/angelayi
2025-09-03 16:53:38 +00:00
9491d289b3 Support generic dynamic shape with padding (#160997)
Summary:
Inductor has the following configurations:

config.comprehensive_padding
config.padding_alignment_bytes
config.padding_stride_threshold

In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today.
This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic.

config.pad_dynamic_shapes

In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80468808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997
Approved by: https://github.com/blaine-rister, https://github.com/jansel
2025-09-03 15:58:18 +00:00
c157cf6488 port distributed tensor parallel test files for Intel GPU (#161261)
In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. Use torch.accelerator for general gpu
2. Skip the case if running on xpu which has known issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161261
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-03 15:03:32 +00:00
bb950284c7 Revert "[inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339)"
This reverts commit 90f50f7e68e120d9574e6e3189e37b4280010ad9.

Reverted https://github.com/pytorch/pytorch/pull/161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](https://github.com/pytorch/pytorch/pull/161339#issuecomment-3249600885))
2025-09-03 14:56:02 +00:00
f27985b7e7 Revert "[CUDAGraph] add config to error on skipping cudagraph (#161862)"
This reverts commit 204697f0e695d82894c5010fbec664c4391f90cc.

Reverted https://github.com/pytorch/pytorch/pull/161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](https://github.com/pytorch/pytorch/pull/161862#issuecomment-3249582583))
2025-09-03 14:50:44 +00:00
0cd6c56bdf Revert "test: ensure editable cached wrapper is respected (#160943)"
This reverts commit bbedc71fd3267c639c38b4ec25eaa22f973d9c4d.

Reverted https://github.com/pytorch/pytorch/pull/160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](https://github.com/pytorch/pytorch/pull/160943#issuecomment-3249565671))
2025-09-03 14:46:35 +00:00
b40d9432be [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-03 14:31:09 +00:00
02c83f1334 [BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)
Followup after https://github.com/pytorch/pytorch/pull/154012

Fixes CPU part of https://github.com/pytorch/pytorch/issues/160841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161999
Approved by: https://github.com/drisspg
2025-09-03 14:31:08 +00:00
aed33a8fcb [Inductor][Tritonparse] Get Inductor kernel params (#161953)
Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse.

Reviewed By: davidberard98, FindHao

Differential Revision: D80994791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161953
Approved by: https://github.com/FindHao
2025-09-03 14:11:27 +00:00
b16d3f4c8c [AOTI] Fix a bug from load_constants (#161887)
Summary:
we have
```
std::vector<size_t> constants_internal_offset(
        num_constants - num_folded_constants);
```

but the for loop does not consider it
```
for (size_t i = 0; i < num_constants; i++) {
...
constants_internal_offset[i]
...
```
even in the for loop, it does
```
bool from_folded = this->constant_from_folded(i);
      if (from_folded) {
        continue;
      }
```
but `i` could still be wrong

Rollback Plan:

Differential Revision: D81425007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161887
Approved by: https://github.com/angelayi
2025-09-03 07:45:16 +00:00
4ae57d448c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-03 07:33:55 +00:00
90b08643c3 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-03 07:33:55 +00:00
b0a3e58dd7 Add inline fast paths for SymInt operators (#161586)
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586
Approved by: https://github.com/ezyang
ghstack dependencies: #161466
2025-09-03 06:54:47 +00:00
fa1514acf1 Outline SymInt::maybe_as_int_slow_path (#161466)
Keeps SymInt::maybe_as_int small enough to inline.

Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161466
Approved by: https://github.com/ezyang
2025-09-03 06:54:47 +00:00
827f0d4054 Using get_paths() to get correct installation path for PYTHONPATY (#161947)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161947
Approved by: https://github.com/albanD
ghstack dependencies: #161845, #161903
2025-09-03 06:38:03 +00:00
2c03f0acc5 [MPS] enable cat op for sparse (#162007)
Enable cat op for sparse on MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007
Approved by: https://github.com/malfet
2025-09-03 06:31:35 +00:00
f8ffa9194e Perf nitpicks on python_arg_parser's is_int_or_symint_list (#161998)
This function has come up in DTensor perf work, and I had a nitpick on #160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161998
Approved by: https://github.com/ezyang
2025-09-03 05:38:30 +00:00
50fc22dedf [Intel GPU] Fix XPU SDPA default priority_order UT fail (#161690)
Fixes #161483

When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by eec876deb6/aten/src/ATen/native/mkldnn/xpu/Attention.cpp (L98-L112)

However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161690
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-09-03 04:43:27 +00:00
e381d4b020 [DTensor] forbid view ops to redistribute when local split is impossible (#161950)
This PR is a followup to https://github.com/pytorch/pytorch/pull/149764.

In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`.

This PR also updates the error message to be less about internal implementation details, which users may find confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161950
Approved by: https://github.com/ezyang
2025-09-03 04:40:11 +00:00
8875d6e394 [vllm hash update] update the pinned vllm hash (#161929)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161929
Approved by: https://github.com/pytorchbot
2025-09-03 04:26:38 +00:00
00636e0171 [Reland][Inductor] Prune configs that require more shared memory than the hardware limit. (#161996)
Summary:
This is a re-land of [PR161040](https://github.com/pytorch/pytorch/pull/161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs.

This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```

Test Plan:
```
pytest test/inductor/test_max_autotune.py
pytest test/inductor/test_triton_heuristics.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161996
Approved by: https://github.com/coconutruben
2025-09-03 04:23:09 +00:00
09d2f1b631 [audio hash update] update the pinned audio hash (#161928)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161928
Approved by: https://github.com/pytorchbot
2025-09-03 04:22:55 +00:00
dac8a4b91c Using pip3 install instead of python setup.py develop/install (#161903)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161903
Approved by: https://github.com/ezyang
ghstack dependencies: #161845
2025-09-03 03:12:18 +00:00
d789451ff6 [OpenReg] Migrate Accelerator Document from source/notes into source/accelerator (#161845)
As the tile stated.

As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845
Approved by: https://github.com/albanD
2025-09-03 03:12:18 +00:00
0447f2d99b build: Add fallback commands to setup.py (#162009)
Adds fallback commands for the following:
* python setup.py install
* python setup.py develop

Ideally these should just work and should provide backwards compat.

Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone.

This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-09-03 02:56:10 +00:00
d5643e8f3a [dynamo, nested graph breaks] support nested graph breaks that cause skipped frames (#160470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160470
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817, #160138, #159786
2025-09-03 02:47:07 +00:00
9b81fe281d [c10d] Lessen density of barrier warning (#162015)
Warnings are great, but too dense when there are many ranks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162015
Approved by: https://github.com/d4l3k, https://github.com/H-Huang
2025-09-03 02:20:54 +00:00
90f50f7e68 [inductor][ez] add hook for heuristics to adjust kernel input nodes (#161339)
# why

- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
  for codegen and heuristics

- unified place where we can just adjust them for the template

# what

- inside get_mm_configs, return not the passed in kernel inputs,
  but allow the template heuristic to adjust them if necessary

- the default implementation right now just passes them back

this diff just adds the functionality, but does not exercise it
other than the default (passthrough)

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
2025-09-03 01:03:57 +00:00
877062c9d3 [inductor][choices][ez] pass through layout and input_nodes (#161338)
# why

- params already available in get_mm_configs
- simplifies the code
- adds a possibility to edit the nodes/layout in
  a centralized place

# what

- add layout and input_nodes into extra_kwargs
- no other modifications

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520575](https://our.internmc.facebook.com/intern/diff/D81520575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161338
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #161123, #161124, #161125, #161126, #161336
2025-09-03 01:03:57 +00:00
c31dee6fa5 [inductor][ez] ExternChoice with maybe_append_choice (#161336)
# why

- make the API for ExternChoice the same as KernelTemplate
- make it possible to use the same retrieval point as templates

# what

- add a maybe_append_choice to ExternChoice that under the hood
  invokes self.bind

This pr does not actuate the new path, but just exposes it

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py
```

Differential Revision: [D81520578](https://our.internmc.facebook.com/intern/diff/D81520578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161336
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126
2025-09-03 01:03:57 +00:00
6cb13dd3cc [inductor] move scaled_mm template args into heuristics (#161126)
# why

- another step towards get_mm_configs providing
  all the kwargs needed to add a choice from
  a template. This in turn will allow us to send
  all templates through one single call, and handle modifications

# what

- use the infrastructure for template heuristics to provide extra kwargs
  that are fixed for a template/op pair to provide the suffix args
  and epilogue function/fn for scaled_mm

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670914](https://our.internmc.facebook.com/intern/diff/D80670914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161126
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125
2025-09-03 01:03:57 +00:00
cbf01c11ff [inductor] move addmm/baddbmm template args into heuristics (#161125)
# why

- another step towards get_mm_configs providing
  all the kwargs needed to add a choice from
  a template. This in turn will allow us to send
  all templates through one single call, and handle modifications

# what

- use the infrastructure for template heuristics to provide extra kwargs
  that are fixed for a template/op pair to provide the prefix args
  and epilogue function/fn for addmm/baddbmm

- expand kernelinputs to also be able to shuttle around non tensor
  inputs (scalars) as is needed for alpha and beta

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k addmm
```

Differential Revision: [D80670912](https://our.internmc.facebook.com/intern/diff/D80670912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161125
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124
2025-09-03 01:03:57 +00:00
7cdfa520a6 [inductor] move tma workspace in heuristics (#161124)
# why

- another step towards get_mm_configs providing
  all the kwargs needed to add a choice from
  a template. This in turn will allow us to send
  all templates through one single call, and handle modifications

# what

use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the workspace_arg for
all the tma templates

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k tma
```

Differential Revision: [D80670915](https://our.internmc.facebook.com/intern/diff/D80670915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161124
Approved by: https://github.com/jansel
ghstack dependencies: #161123
2025-09-03 01:03:57 +00:00
1485ac3264 [inductor] add notion of extra_kwargs for mm_configs (#161123)
# why

- some kwargs are choice independent but rather
  always the same for a specific op or template
- this enables us to track those differently than the
  choice ones, and thus enables interception of them
  cleaner
- maybe_append_choices can then be simplified to
  just pass through the kwargs

# what

- hookup for template heuristics to have per template/op extra
  kwargs that are always the same, for all choices
- hookup for the called to get_mm_configs to provide template/op
  kwargs to override some of the template/choice kwargs

this pr does not use the new machinery, and everything is empty
for now. subsequent prs start using it to simplify ops

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670916](https://our.internmc.facebook.com/intern/diff/D80670916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161123
Approved by: https://github.com/jansel
2025-09-03 01:03:57 +00:00
c5b8a10be5 Fix compiler errors in 3.14 stub definitions (#161792)
The functions here expect to return pointers, but currently aren't returning anything.  Make them return NULL.

The properties array wants an extra set of braces.  One pair for the array, another for the first item in the array.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161792
Approved by: https://github.com/Skylion007
2025-09-03 00:58:41 +00:00
a02ee4a816 [SymmMem] Use non-blocking version of getmem (#162006)
As titled, so that the `getmem` calls in the loop are non-blocking, so that we max out the issuance rate.
Also had a single `nvshmem_quiet()` at the end to make sure all the getmem calls complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162006
Approved by: https://github.com/ngimel
2025-09-02 23:55:22 +00:00
81b7b16618 Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)" (#161949)
This PR reland #161142 which is reverted to be able to revert other PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949
Approved by: https://github.com/jansel
2025-09-02 23:43:27 +00:00
4cdaf8265d Revert "Update Kineto submodule (#161572)"
This reverts commit d33840c542b387ab08ba49aa6c45aa9567fd9be7.

Reverted https://github.com/pytorch/pytorch/pull/161572 on behalf of https://github.com/seemethere due to This appears as though its causing downstream build failures in inductor workflows and for developers working locally. Going to revert out of an abundance of caution. ([comment](https://github.com/pytorch/pytorch/pull/161572#issuecomment-3247121981))
2025-09-02 23:28:19 +00:00
874069fbe4 Log Const Folded Node (#161827)
Summary: Log folded nodes for easier debugging.

Test Plan:
sandcastle.

Rollback Plan:

Reviewed By: henryoier

Differential Revision: D81352098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161827
Approved by: https://github.com/henryoier, https://github.com/yewentao256
2025-09-02 23:23:51 +00:00
ab643e4dbb [SymmMem] Increase minimum nthreads to cover sync needs in NVL72 (#161983)
`sync_remote_blocks` maps threads to peers. Previously min nthreads is warp size, which is too small to cover NVL72. Bumping it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161983
Approved by: https://github.com/ngimel
2025-09-02 23:18:08 +00:00
5a2da090ed [SymmMem] Make sure CUDA runtime is initialized before NVSHMEM init (#161232)
Previously, without calling `torch.empty` before NVSHMEM init, we see error below:
```
src/host/init/init.cu:nvshmemi_check_state_and_init:1117: nvshmem initialization failed, exiting
src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
```
Fixing it by calling a `cudaFree(nullptr)` to make sure CUDA runtime is initialized before NVSHMEM init.
Removing all `torch.empty(1)` calls from tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161232
Approved by: https://github.com/ngimel
ghstack dependencies: #161214
2025-09-02 22:53:28 +00:00
bd39e47fee [ONNX] Default to dynamo export (#159646)
Set dynamo=True and enable fallback.

1. Implemented the compatible behavior where BytesIO objects as `f` is accepted
2. Update tests to explicitly set dynamo=False

#151693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646
Approved by: https://github.com/titaiwangms
2025-09-02 22:45:55 +00:00
e4bd0ff4f8 [aot precompile] Handle closure variables. (#161990)
We previously assume aot precompile should only work on non closures. This is hard to enforce in practice because we will see a lot of cases with decorater (e.g. hugging face models)
```
def check_inputs(fn):
    def _fn(self, *args, **kwargs):
        for arg in args:
            assert arg.shape[0] > 1

        return fn(*args, **kwargs)
    return _fn

@check_inputs
def foo(x, y):
    a = x + x
    b = y + y
    c = a + b
    return c
```
It doesn't make sense to not support these cases since they are straightfowrad to do.

This PR adds the logic to handle closure and make sure they can be precompiled properly.

Differential Revision: [D81509535](https://our.internmc.facebook.com/intern/diff/D81509535/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161990
Approved by: https://github.com/angelayi
2025-09-02 22:26:04 +00:00
15c77a8cfd Revert "Add inductor provenance mapping for cpp extern kernel (#161656)"
This reverts commit 5e5870e858f60ff4bf87d03f3592097e934a9580.

Reverted https://github.com/pytorch/pytorch/pull/161656 on behalf of https://github.com/jeffdaily due to causing failures on ROCm MI300, will add label to PR ([comment](https://github.com/pytorch/pytorch/pull/161656#issuecomment-3246965676))
2025-09-02 22:19:19 +00:00
791eff96c8 [MPS] Add igamma/igammac ops (#161927)
Fixes #161725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161927
Approved by: https://github.com/malfet
2025-09-02 20:52:02 +00:00
80dd397f19 Argsort doc stable kwargs (#161986)
Fixes #129311

Updated torch.argsort documentation to reflect that the 'stable' parameter is a keyword argument and not a normal parameter.

@albanD, @soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161986
Approved by: https://github.com/soulitzer
2025-09-02 20:42:53 +00:00
a75e8cd270 Add api info for torch._C._nn.pyi (#161958)
Fix part of #148404

APis involved are as followed:

- max_pool2d_with_indices
- max_pool3d_with_indices
- elu
- glu
- max_unpool2d
- max_unpool3d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161958
Approved by: https://github.com/ezyang
2025-09-02 20:39:20 +00:00
4e42aa8ffc Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit b7034e9c924412bfbe8ee25a22d7e95239b5ca65.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))
2025-09-02 20:28:42 +00:00
420c52ecf3 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))
2025-09-02 20:24:01 +00:00
82f63c8f6d Revert "[HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946)"
This reverts commit 5561e45758d59c94605873d5db48ed459c004c3b.

Reverted https://github.com/pytorch/pytorch/pull/161946 on behalf of https://github.com/jeanschmidt due to Need to be reverted so https://github.com/pytorch/pytorch/pull/159889 can be ([comment](https://github.com/pytorch/pytorch/pull/161946#issuecomment-3246663376))
2025-09-02 20:18:52 +00:00
b4ad38279b [AOTI] Add Windows-compatible implementation of the mmap-related funcs (#161805)
Add Windows-compatible implementation of the mmap-related functions.

These code was validated on the small developing project: https://github.com/xuhancn/cross_os_mmap?tab=readme-ov-file#cross_os_mmap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161805
Approved by: https://github.com/angelayi
2025-09-02 20:07:41 +00:00
ef8aabd424 [CD][CUDA13][ARM] aarch64 binary seems to be missing Triton dependency (#161833)
Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions

Seems to be missing Triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161833
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman
2025-09-02 19:31:14 +00:00
dcf385395d [MPS] Move sparsemps testing from test_mps to test_sparse (#161852)
Moves Sparse MPS testing from test_mps to test_sparse. Lots of skips now but I expect to remove them iteratively once ops are implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161852
Approved by: https://github.com/malfet
2025-09-02 19:04:11 +00:00
600c25e9a1 [dynamo] Graph break on torch.cuda.sychronize (#161925)
Today, AOTDispatcher ignores cuda.synchornize. Even if we wrap it in
some  HOP, we need it to be a barrier op to prevent any inductor
reordering. So graph breaking.

Fixes https://github.com/pytorch/pytorch/issues/160751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161925
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/mlazos
2025-09-02 19:00:21 +00:00
f981a7fa52 [SymmMem] Add device guard before alloc (#161214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161214
Approved by: https://github.com/ngimel
2025-09-02 18:53:45 +00:00
b7e207ca9f Make error message descriptive (#150627) (#159423)
Summary:

Adding the number of locals shards to error messages makes it easier to debug.

Test Plan: UT

Differential Revision: D72396478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159423
Approved by: https://github.com/Saiteja64
2025-09-02 17:54:39 +00:00
5e5870e858 Add inductor provenance mapping for cpp extern kernel (#161656)
Summary: Add inductor provenance mapping for cpp extern kernel

Test Plan:

```
buck run fbcode//caffe2/test/inductor:provenance_tracing --  -r test_cpu_extern_kernel
```

Differential Revision: D81161751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161656
Approved by: https://github.com/angelayi
2025-09-02 17:54:04 +00:00
a99d8d39bc Update torch-xpu-ops commit pin (#161919)
# Motivation
1. Fallback some linalg functionality such as `linalg_eig`, `linalg_householder_product`, `linalg_solve_triangular` to CPU;
2. Fix codegen dependency bug.

# Additional Context
This PR aims to fix https://github.com/pytorch/pytorch/issues/161498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161919
Approved by: https://github.com/EikanWang
2025-09-02 17:09:07 +00:00
d6b74568e2 Revert "Add __init__.pyi to torch/linalg (#160750)"
This reverts commit 9a665ca3c472384e9d722bddba79e5a7680f1abd.

Reverted https://github.com/pytorch/pytorch/pull/160750 on behalf of https://github.com/jeanschmidt due to Seems that those errors are legitimate, and there is no test plan. I'll be proceeding with a revert ([comment](https://github.com/pytorch/pytorch/pull/160750#issuecomment-3246095383))
2025-09-02 16:53:55 +00:00
d33840c542 Update Kineto submodule (#161572)
Differential Revision: D81087601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161572
Approved by: https://github.com/cyyever, https://github.com/aaronenyeshi
2025-09-02 16:31:55 +00:00
f0c391102b [ONNX] Remove private members from torch.onnx (#161546)
Remove import of two functions

- _run_symbolic_function
- _run_symbolic_method

to the `torch.onnx` namespace.

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161546
Approved by: https://github.com/titaiwangms
ghstack dependencies: #161323, #161449
2025-09-02 16:31:23 +00:00
a8d6943d36 ROCm: Enable overload tests from test_matmul_cuda (#161540)
This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload.
Tests were disabled as part of #150812
Rocblas backend tests are not enabled yet, WIP.

Test command
PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540
Approved by: https://github.com/jeffdaily
2025-09-02 16:27:42 +00:00
d11720efdb [ONNX] Remove unused logic from internal verification module (#161449)
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161449
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
ghstack dependencies: #161323
2025-09-02 16:22:49 +00:00
9a1c5c0a07 Detect torch function in lists as well (#160256)
We basically follow the same pattern we do for tensor arguments. The major downside is we now have to traverse the entirety of the int list / etc where previously we didn't have. Benchmark suggests 2% regression for relevant things.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160256
Approved by: https://github.com/albanD
2025-09-02 16:22:42 +00:00
524b78d4f6 [ONNX] Refactor torchscript based exporter (#161323)
Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved.

- Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly
- Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused

@albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and
`torch/autograd/_functions/utils.py`? Thanks!

## BC Breaking
- **Deprecated members in `torch.onnx.verification` are removed**

Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323
Approved by: https://github.com/titaiwangms, https://github.com/angelayi
2025-09-02 16:10:30 +00:00
793fc12aff [CD] Fix setup-xpu action issue (#161934)
Fix XPU CD test failure, refer https://github.com/pytorch/pytorch/actions/runs/17370923627/job/49315624191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161934
Approved by: https://github.com/atalman
2025-09-02 16:03:44 +00:00
204697f0e6 [CUDAGraph] add config to error on skipping cudagraph (#161862)
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.

This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862
Approved by: https://github.com/ezyang
2025-09-02 15:28:22 +00:00
789d494212 Defer loading hipify until it is needed (#160824)
Saves a few milliseconds when running a test case:

Before:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 1.497s
```

After:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 0.909s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824
Approved by: https://github.com/zou3519
2025-09-02 15:27:37 +00:00
bc4db2c27f CUDA 13 -- sm_120 -- Nvidia 5090 -- ptxas warning : Value of threads … (#161380)
bug fix:

i have opened a issue ( https://github.com/pytorch/pytorch/issues/161376 ) and i suggest this bug fix.

In this metod compile fine.

Fixes #161376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161380
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
2025-09-02 13:27:57 +00:00
e304ea4e69 Revert "[BE] Update xpu driver repo for CD used almalinux 8.10 (#157356)"
This reverts commit c78bbdf4102d2c13bf6aa1abe4352aa7bca401ca.

Reverted https://github.com/pytorch/pytorch/pull/157356 on behalf of https://github.com/chuanqi129 due to This PR has performance regression on some workloads ([comment](https://github.com/pytorch/pytorch/pull/157356#issuecomment-3245319046))
2025-09-02 13:20:38 +00:00
1f820de639 [ci] Increase shards for linux-jammy-py3.10-clang18-asan on pull.yml to 7 (#161968)
[ci] Increase shards for linux-jammy-py3.10-clang18-asan to 7
2025-09-02 14:08:47 +02:00
fca2601c9d Improve error message for unsupported padding config (#160866)
Fixes #160053

The previous error message `Only 2D, 3D, 4D, 5D padding with non-constant  padding are supported for now`  was not clear

now we have

```
python3
Python 3.13.5 | packaged by conda-forge | (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
... import torch.nn.functional as F
... a = torch.empty(2,2,2,2)
... F.pad(a, (1,1), mode="circular")
...
Traceback (most recent call last):
  File "<python-input-0>", line 4, in <module>
    F.pad(a, (1,1), mode="circular")
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rrathaur/Desktop/pytorch/torch/nn/functional.py", line 5294, in pad
    return torch._C._nn.pad(input, pad, mode, value)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: Padding size 2 is not supported for 4D input tensor.
Supported combinations for non-constant padding:
  - 2D or 3D input: padding size = 2 (pads last dimension)
  - 3D or 4D input: padding size = 4 (pads last 2 dimensions)
  - 4D or 5D input: padding size = 6 (pads last 3 dimensions)
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160866
Approved by: https://github.com/mikaylagawarecki
2025-09-02 07:15:59 +00:00
f8746b878d Add uuid to XPU device properties (#161392)
# Motivation
Fix https://github.com/intel/torch-xpu-ops/issues/1955
Refer to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-uuid, `ext::intel::info::device::uuid` returns `std::array<unsigned char, 16>` as the UUID.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161392
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-09-02 06:41:32 +00:00
8703debf66 [DTensor] select strategy with no redistribute when redistribute cost is 0 (#161882)
Before this PR, the `_select_strategy` always selects the first strategy with minimum redistribute cost. This causes unexpected behavior when
- multiple strategies have 0 redistribute costs
- the first one with 0 redistribute cost may perform local chunking

E.g. in memory efficient SDPA, the default orders of candidate strategies have a `Shard(2)` one before the `Replicate()` one. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_matrix_ops.py#L500-L512
When the input is `Replicate()`, `_select_strategy` will pick the `Shard(2)` strategy and do local chunking first, before local computation. This is clearly unexpected to users.

In this PR, we improve `_select_strategy` so that when multiple strategies have 0 redistribute cost, we prioritize the one which keeps input unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161882
Approved by: https://github.com/ezyang
2025-09-02 05:41:56 +00:00
1aeb421c34 Make pattern matcher resilient to ddes (#161843)
Motivated by the following discord support chat: https://discord.com/channels/1189498204333543425/1409578286186758195

```
import torch
@torch.compile(fullgraph=True, mode='reduce-overhead')
def get_mask(W: torch.Tensor, percentage_nonzeros: torch.Tensor):
    total_elements = W.numel()
    k = int(total_elements * percentage_nonzeros)
    top_k_indices = torch.topk(torch.abs(W).flatten(), k)[1]
    mask = torch.zeros(total_elements, dtype=torch.bool, device=W.device)
    mask.scatter_(0, top_k_indices, True)
    mask = mask.view(W.shape)
    return mask

x = torch.randn((128, 64), device='cuda')
p = torch.tensor(0.50, device='cuda')
get_mask(x, p)
```

Results in

```
InductorError: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(TruncToInt(zuf0), 1) (unhinted: Eq(TruncToInt(zuf0), 1)).  (Size-like symbols: none)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161843
Approved by: https://github.com/ezyang
2025-09-02 05:16:13 +00:00
5561e45758 [HOTFIX] Disable DISTRIBUTED_C10D_DIRECT_ACCESS for now (#161946)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161946
Approved by: https://github.com/msaroufim
2025-09-02 05:01:46 +00:00
8171d6052e Clear custom autograd Function ctx.to_save earlier (#161171)
Fixes https://github.com/pytorch/pytorch/issues/161186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161171
Approved by: https://github.com/albanD
2025-09-02 03:26:31 +00:00
d5e0f4202b Fixes broken memory_viz link in CUDA memory docs (#161426)
Fixes #161375

The  "Using the visualizer" section in torch_cuda_memory.md had a link to  https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link.

I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426
Approved by: https://github.com/soulitzer
2025-09-02 02:06:54 +00:00
13d66e2a66 [BE][Easy] restore #157584 after #158288 (#158541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158541
Approved by: https://github.com/ezyang
2025-09-02 02:06:50 +00:00
bbedc71fd3 test: ensure editable cached wrapper is respected (#160943)
## Summary
- add a test verifying that editing the local cache wrapper is picked up after Dynamo reset

## Testing
- `lintrunner -a` *(fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure)*
- `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v`

------
https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160943
Approved by: https://github.com/xmfan
2025-09-02 01:48:30 +00:00
e9481b6617 [dynamo] Prevent unnecessary recompile on disabled functions in the compiled frame (#161883)
Trying out a re-impl of https://github.com/pytorch/pytorch/pull/160934

The above PR led to OOM, most likely because of the cache holding to a nested function (which if not held in the cache would have been garbage collected), which holds on to cuda tensors in its closure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161883
Approved by: https://github.com/jansel
2025-09-02 01:13:48 +00:00
1c1b28d5b6 Fix slice scatter dtype consistency (#160851)
Fixes #147842
Fix torch.slice_scatter type inconsistency issue. I noticed previous PRs on this have stalled, so I'm opening this new PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160851
Approved by: https://github.com/soulitzer
2025-09-02 01:08:26 +00:00
2a5c0785e2 [AOTI] split too long string to smaller pieces when its length larger than 16000, fix msvc c2026. (#161850)
Split too long string to smaller pieces when its length larger than 16000, fix msvc c2026.

reproducer:
```cmd
pytest test\inductor\test_aot_inductor.py -v -k test_runtime_checks_large_cpu
```

Error message:
<img width="1660" height="174" alt="image" src="https://github.com/user-attachments/assets/56fcd9be-24cb-484b-bfdc-f719ff2650b8" />

For MSVC c2026:
https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2026?view=msvc-170

We can split too long string to smaller pieces, it can fix this issue.

Local validated:
<img width="1122" height="232" alt="image" src="https://github.com/user-attachments/assets/cac54cc9-be51-4a5d-b408-06755a4debd5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161850
Approved by: https://github.com/jansel
2025-09-02 00:09:01 +00:00
626cb7df81 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-01 23:00:21 +00:00
b7034e9c92 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-01 23:00:21 +00:00
13b65196db Revert "Defer loading hipify until it is needed (#160824)"
This reverts commit 403a3a393cda7e60f503f3b04b8805a845dcf45d.

Reverted https://github.com/pytorch/pytorch/pull/160824 on behalf of https://github.com/atalman due to Broke slow tests test_utils.py::TestHipifyTrie::test_special_char_export_trie_to_regex [GH job link](https://github.com/pytorch/pytorch/actions/runs/17387051351/job/49355619371) [HUD commit link](403a3a393c) ([comment](https://github.com/pytorch/pytorch/pull/160824#issuecomment-3243281628))
2025-09-01 21:34:13 +00:00
403a3a393c Defer loading hipify until it is needed (#160824)
Saves a few milliseconds when running a test case:

Before:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 1.497s
```

After:
```
$ PYTORCH_TEST_WITH_DYNAMO=1 python test/dynamo/cpython/3_13/test_float.py GeneralFloatCases.test_float_pow
frames [('total', 1), ('ok', 1)]
inline_call []
.
----------------------------------------------------------------------
Ran 1 test in 0.909s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160824
Approved by: https://github.com/zou3519
2025-09-01 20:57:41 +00:00
cbfb005f7c Fix type checking for persistent loads in the weights-only unpickler (#161661)
The error message here implies that we can only call `self.persistent_load(...)` for ints or tuples, but due to the second part of the type check being inverted, weights-only unpickler will throw an exception iff `pid` is an int.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161661
Approved by: https://github.com/Skylion007
2025-09-01 19:57:19 +00:00
d232a95d4a [BE] Consolidate inductor benchmark Docker images and rename jobs (#161536)
We have 4 different version of inductor benchmark Docker images used in CI at the moment:

1. `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks` is used by almost all inductor jobs including nightly benchmark
2. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.12
3. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.13
4. `pytorch-linux-jammy-py3-gcc11-inductor-benchmarks` runs inductor unit tests on CPU

My proposal here is to clean up (2) and (3) and to keep (1) under the same setup from https://ghcr.io/pytorch/torchbench.  Simplicity is the key here as inductor workflows are getting more and more complex:
1. Unit tests for Python variant like 3.12 and 3.13 were useful when they were first added to CI.  They are much less useful now.  [Flambeau](https://hud.pytorch.org/flambeau/s/3876ec7b-43f0-42c6-bfbf-899035e5bb77) shows a 0.97 correlation between them.  And we are also moving to 3.14 nowadays.  I want to choose 3.12 for (1), but will do this separately.  This is also what TorchBench and vLLM are using on CI.
1. We are gradually cleaning up 3.9 on CI https://github.com/pytorch/pytorch/issues/161167

Another BE change here is to rename the jobs various inductor workflows because I think names like `linux-jammy-cuda12_8-py3_10-gcc9-inductor-build` is too long and confusing to look at, better just use human-friendly names like `inductor-build`.  Other information is already spelled out in the build environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161536
Approved by: https://github.com/zou3519
2025-09-01 19:07:08 +00:00
17fa8eec4a Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)"
This reverts commit 4b4cdcfe3af10df624878985caac4e595fbab54c.

Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/atalman due to need to revert due to merge conflicts, please feel free to merge it back in once conflicts are resolved ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3242945661))
2025-09-01 17:08:27 +00:00
54e275e0d8 Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)"
This reverts commit c83cbd2f2a2de2e3258f07de77d8740743df6d2d.

Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))
2025-09-01 17:03:50 +00:00
63a9c23fe9 Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)"
This reverts commit 190c391a28845a14df26abb228d26aa813efb20c.

Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))
2025-09-01 16:27:03 +00:00
fefee08164 [CD] Add CUDA 13.0 Windows build (#161663)
Test CUDA 13.0 windows build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161663
Approved by: https://github.com/malfet, https://github.com/atalman
2025-09-01 15:27:17 +00:00
21fae99c18 Revert "[cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)"
This reverts commit 55c289d5c104c4959cc125c0fb4fb50c9fc71102.

Reverted https://github.com/pytorch/pytorch/pull/161305 on behalf of https://github.com/atalman due to Broke test_matmul_cuda.py::TestFP8MatmulCUDA::test_float8_error_messages_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17309011599/job/49140215634) [HUD commit link](1190b7f73e) ([comment](https://github.com/pytorch/pytorch/pull/161305#issuecomment-3242652672))
2025-09-01 14:56:47 +00:00
2ba65472dd [xla hash update] update the pinned xla hash (#161396)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161396
Approved by: https://github.com/pytorchbot
2025-09-01 11:43:03 +00:00
190c391a28 [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-01 09:25:01 +00:00
20bfb2539d Skip compilation when FX graph has no calls and returns empty (#160536)
Fixes #160437

Summary:
This PR avoids compiling empty FX graphs generated during graph breaks. If there are no calls in the graph, we can just return the empty list of instructions.

More precisely,
In compile_and_call_fx_graph, if the FX graph contains no calls (count_calls(self.graph) == 0) and the return value list is empty, we now return an empty instruction list immediately

Impact:
module: dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160536
Approved by: https://github.com/Lucaskabela
2025-09-01 08:32:22 +00:00
dd2519abe8 ci: Update sphinx, disable google search by default (#161793)
Includes fixes from https://github.com/pytorch/pytorch_sphinx_theme/pull/207

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161793
Approved by: https://github.com/malfet, https://github.com/albanD
2025-09-01 07:43:39 +00:00
2f6b4b1ad3 [4/N][SymmMem] Add get_remote_tensor + move up get_buffer and get_signal_pad (#161533)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

`get_remote_tensor `: return a symmetric tensor given a peer rank.

The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.

As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
2025-09-01 07:02:06 +00:00
6737e2c996 update supported OS for Intel client GPU (#161699)
update supported OS for Intel client GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161699
Approved by: https://github.com/chuanqi129, https://github.com/malfet
2025-09-01 05:45:09 +00:00
67c31dcd36 [vllm hash update] update the pinned vllm hash (#161867)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161867
Approved by: https://github.com/pytorchbot
2025-09-01 04:37:13 +00:00
cb1e31362c Remove background thread UT on XPU to fix CI (#161844)
# Motivation
Because we revert `torch._C._set_allocator_settings` in https://github.com/pytorch/pytorch/pull/161626, this UT becomes invalid.
Fix https://github.com/pytorch/pytorch/issues/161697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161844
Approved by: https://github.com/gujinghui
2025-09-01 03:45:26 +00:00
9a665ca3c4 Add __init__.pyi to torch/linalg (#160750)
Fixes #149639

In an effort to improve the type checking coverage, added a stub file for the torch/linalg directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160750
Approved by: https://github.com/Skylion007
2025-08-31 22:39:05 +00:00
d9d6dde0f4 Leak Python filenames so that we can give good dispatcher errors. (#160418)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160418
Approved by: https://github.com/zou3519
2025-08-31 22:31:39 +00:00
68738beff7 PythonArgs::toBool: order cheap mutually exclusive checks first (#161455)
symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161455
Approved by: https://github.com/Skylion007
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329, #161432
2025-08-31 21:35:48 +00:00
25f4aaed9e [3/N][SymmMem] Expose offset field from handle (#161532)
As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471
2025-08-31 18:08:57 +00:00
61e18b5304 [2/N][SymmMem] Add MemPool allocator and tests (#161471)
(Porting most of #161008)

Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.

To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
    tensor = torch.arange(numel, dtype=dtype, device=device)
```

Added tests for both use cases above.

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-31 18:08:57 +00:00
e92cd94153 removed duplicate imports (#161685)
Fixes #161684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161685
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-31 16:21:49 +00:00
0d421ace32 fix spelling of word - when (#160185)
just found a typo while understanding the codebase while working on another PR

This fixes typo in word `when` in files

```
native/cpu/PaddingKernel.cpp
native/cpu/batch_norm_kernel.cpp
```

@eqy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160185
Approved by: https://github.com/yewentao256, https://github.com/ezyang
2025-08-31 13:38:23 +00:00
91f0bcf43f [c10d][nvshmem] add nvshmem build rules and dependency for libtorch_cuda (#159562)
Summary:
Add guarded build option for nvshmem-related c10d code with `-c fbcode.caffe2_use_nvshmem`

Guarded clause include nvshmem device + host code (static-linked) + these 2 files:
- `torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu`
-    `torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159562
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-08-31 12:56:51 +00:00
75bc23cfc3 [CPU][Inductor] Improve performance of A16W8 GEMM template (#161148)
**Summary**
This PR improves the performance of A16W8 GEMM template by
- Removing the config with block_n=48 & block_m=16 as it is not very efficient.
- Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31.
- Converting int8 values to bf16 with intrinsics instead of `at::vec::convert` as the latter does not have optimized implementation for this case.

We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct.

**Test plan**
Already covered by UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161148
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-31 09:56:29 +00:00
377033757a Use vectorized stores for all dtypes in cat (#161649)
resurrecting #151818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649
Approved by: https://github.com/Skylion007
2025-08-31 05:42:41 +00:00
f612045ce1 [vllm hash update] update the pinned vllm hash (#161835)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161835
Approved by: https://github.com/pytorchbot
2025-08-31 04:24:04 +00:00
ad7b748686 [AOTI] fix ut, add extension file type for Windows. (#161851)
fix ut, add extension file type for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161851
Approved by: https://github.com/ezyang
2025-08-31 01:13:29 +00:00
f3697b033e [MPS] add bunch of unary funcs for sparse tensors (#161846)
adds bunch of unary functions for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161846
Approved by: https://github.com/malfet
2025-08-30 21:13:05 +00:00
2d31c3d99d Pass shared_ptr by value (#161834)
The way AsyncAllreduceCUDADeviceWork is currently implemented,
using it will force a copy of `shared_ptr<gloo::Context>`
because `std::move` does nothing for a const ref.

This PR changes the param type to shared_ptr<> instead of the
const ref. This allows more efficient parameter passing.

Here's an example that demonstrates the issue:

```cpp
#include <memory>
#include <iostream>

struct Foo {};

void useFoo_ref(const std::shared_ptr<Foo>& f) {
    std::shared_ptr<Foo> internal = std::move(f);
    std::cout << "use_count: " << internal.use_count() << '\n';
}

void useFoo_val(std::shared_ptr<Foo> f) {
    std::shared_ptr<Foo> internal = std::move(f);
    std::cout << "use_count: " << internal.use_count() << '\n';
}

int main() {
    std::shared_ptr<Foo> f1 = std::make_shared<Foo>();
    useFoo_ref(std::move(f1)); // prints "use_count: 2"

    std::shared_ptr<Foo> f2 = std::make_shared<Foo>();
    useFoo_val(std::move(f2)); // prints "use_count: 1"
}
```

This also aligns well with [C++ Core Guidelines][1] for handling
smart pointers.

[1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
2025-08-30 18:00:37 +00:00
fb2d5ea697 Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471)"
This reverts commit b291dc9684d00396239a0c7786b7aac71bf69c05.

Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3239283585))
2025-08-30 14:00:29 +00:00
2e1345a0f8 Revert "[3/N][SymmMem] Expose offset field from handle (#161532)"
This reverts commit ff9533970ad76ed1905b90df6515aca50354c193.

Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to Multiple internal failures on PR #https://github.com/pytorch/pytorch/pull/161471 will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3239282308))
2025-08-30 13:57:50 +00:00
684ae48c16 Revert "[4/N][SymmMem] Add get_remote_tensor + move up get_buffer and get_signal_pad (#161533)"
This reverts commit 95516ad7e6d92ed131fb6057b29ec52e73190e3c.

Reverted https://github.com/pytorch/pytorch/pull/161533 on behalf of https://github.com/atalman due to Multiple internal failures on PR #[161471](https://github.com/pytorch/pytorch/pull/161471) will need to land it via co-dev ([comment](https://github.com/pytorch/pytorch/pull/161533#issuecomment-3239278635))
2025-08-30 13:51:22 +00:00
b93f87d67b [OpenReg] Integrate Event&Stream from OpenReg Backend into PyTorch (#160100)
We integrated the openreg backend’s `Stream` and `Event` into PyTorch, all of which are similar
to other accelerators like `CUDA`, `XPUs`, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160100
Approved by: https://github.com/albanD
ghstack dependencies: #161603, #160099, #161773
2025-08-30 13:21:28 +00:00
6284881b2a [OpenReg] Add tests of device and memory for OpenReg (#161773)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161773
Approved by: https://github.com/albanD
ghstack dependencies: #161603, #160099
2025-08-30 13:21:28 +00:00
aae9cbb6c0 [OpenReg] Add Event&Stream Support for OpenReg Backend (#160099)
Referring to the signatures and functions of `Stream` and `Event` in CUDA, we use CPU multithreading
and conditional variables to implement equivalent capabilities as the underlying foundation of torch_openreg.

**Changes:**

- Add stream capabilities for OpenReg
- Add event capabilities for OpenReg
- Add kernel launch entrypoint for OpenReg
- Add testcases about stream and event for OpenReg
- Add example for OpenReg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160099
Approved by: https://github.com/albanD
ghstack dependencies: #161603
2025-08-30 13:21:21 +00:00
dad2e50ac5 [OpenReg] Rename cpu_fallback_blacklist to cpu_fallback_blocklist (#161603)
As the title stated.

Related Infos: https://github.com/pytorch/pytorch/pull/158644#discussion_r2301460839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161603
Approved by: https://github.com/albanD
2025-08-30 13:21:15 +00:00
37da7b777b Fix _scaled_grouped_mm not reported as unsupported on SM100. (#161780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161780
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel, https://github.com/Skylion007, https://github.com/eqy
2025-08-30 12:33:51 +00:00
c83cbd2f2a [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)
Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142
Approved by: https://github.com/jansel
2025-08-30 11:09:07 +00:00
b994f6e3b3 [inductor] check block options after broadcasting and singleton dims have been removed (#161602)
This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match
because the innermost dimension does not have stride 1
```python
block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0])
```
After broadcasting dimensions and singleton dimensions are removed:
```python
block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602
Approved by: https://github.com/jansel
2025-08-30 08:10:51 +00:00
f44ad54bc6 Update torch-xpu-ops commit pin (#161152)
Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](8b58040ee3), includes:

- Add vectorization path on maxpool forward channel last
- Add FlightRecorder support for ProcessGroupXCCL
- Fix random build failure on codegen
- Suppress dllexport warning on Windows
- Make torch-xpu-ops build depend on ATen XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152
Approved by: https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-08-30 07:19:24 +00:00
4d3ab2669b Stop trying to intern arguments in PyObject_FastGetAttrString (#161432)
If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see b222eb66c7 (diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163))

Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161432
Approved by: https://github.com/ezyang
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329
2025-08-30 06:55:43 +00:00
0ee8a4e281 Fix accidental copy in pushPyOutToStack (#161329)
`auto` forces a copy. Confirmed this did something noticable with perf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161329
Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328
2025-08-30 06:55:43 +00:00
eb9526ae35 Avoid double hash lookup in torch._library.simple_registry (#161328)
Not a huge cost, but free win is free.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161328
Approved by: https://github.com/Skylion007
ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317
2025-08-30 06:55:43 +00:00
302d860157 Improve assert perf in _python_dispatch._correct_storage_aliasing (#161317)
This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161317
Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308, #161315
2025-08-30 06:55:42 +00:00
0c459f2921 Fix pybind enum efficiency issue in return_and_correct_aliasing (#161315)
Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation.

This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161315
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304, #161308
2025-08-30 06:55:42 +00:00
b96bcb9fdb Optimize _python_dispatch.return_and_correct_aliasing.get_write_alias (#161308)
- Empty containers are Falsey
- Hoist cheap checks first
- Microbenchmarked single-element set access method

Benchmark code:
```
import timeit

to_test = [
    ('list(x)', 'x = set([3])'),
    ('x[0]', 'x = [3]'),
    ('list(x)[0]', 'x = set([3])'),
    ('next(iter(x))', 'x = set([3])'),
]

for (stmt, setup) in to_test:
    res = timeit.timeit(stmt=stmt, setup=setup)
    print(f"Time for `{stmt}`: {res}")
```

Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux)
```
Time for `list(x)`: 0.03418
Time for `x[0]`: 0.00852
Time for `list(x)[0]`: 0.03561
Time for `next(iter(x))`: 0.02278
```

FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304
2025-08-30 06:55:42 +00:00
2089ed3d5e Use is, not ==, to check exact type matches in _python_dispatch (#161304)
`is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161304
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292
2025-08-30 06:55:42 +00:00
1a64bf2636 Stop accessing func._schema in _python_dispatch.correct_storage_aliasing (#161292)
func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161292
Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh
ghstack dependencies: #161301
2025-08-30 06:55:42 +00:00
5d35b49ba7 Fix forced copying def_property_readonly for FunctionSchema & friends (#161301)
This took me a bit to figure out and I'm pretty sure I've looked at
this code before. Pybind uses
`return_value_policy::reference_internal` for `def_property`, which
[causes the owning object to be kept alive for the lifespan of the
return
value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html),
allowing the getter to safely avoid copying the property
value. However, lambdas act like they return `auto`, not
`decltype(auto)`, so our lambdas themselves were forcing copies!

Testing: observed std::vector<Argument> copying disappear in Linux
perf profile of someOpInfo._schema.arguments/returns (in
_python_dispatch.correct_storage_aliasing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161301
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab
2025-08-30 06:55:42 +00:00
db622842bc [Inductor][CPP] Optimize config selecting for micro gemm when number of mxn blocks can not occupy all the threads (#161144)
If number of mxn blocks can not occupy all the threads, use smaller register block size will get better performance since the computing size per thread is smaller.
It may get ~20% performance improvement for the real case `m1_n512_k4096`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161144
Approved by: https://github.com/leslie-fang-intel
2025-08-30 05:53:49 +00:00
77d8e98e1b [Inductor] update exp codegen for better precision (#161829)
Prior to this PR, we have:
```
[Default Behavior] uses `tl.math.exp({x})`:
eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592

TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`:
eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367

Update inductor to use `tl.extra.libdevice.exp(tmp0)`:
eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593
```

Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now.

Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829
Approved by: https://github.com/jansel
2025-08-30 04:56:51 +00:00
2fed4fb464 [FlexAttn] Fix Paged Attention Accuracy via Upper Mask Mod and Prevent Invalid Memory Access (#160861)
Fixes #159247
Issue 1: Accuracy Problem with Non-Divisible KV Sequences
---------------------------------------------------------

### Background

Paged attention in flex decoding produced inaccurate results when KV sequence length is not divisible by block size. For example, when `KV_S = 64` and `block_size = 128`, the output didn't match standard attention accuracy.

### Root Cause
The current paged attention does not apply upper mask mod when converting from logical to physical mask mod. Instead, it uses a noop_mask by default which makes all the values unmasked, leading to an accuracy mismatch. Adding a upper mask mod according to the origin actual kv_len (64 in this test case) resolves the issue.

### Solution

*   **Applied proper upper bound masking**: Updated all calls to `convert_logical_block_mask` to pass `kv_len` as a tensor with proper shape `[B, KV_S]` to provide information of actual batched KV sequence length. The function now correctly applies upper bound checks using the actual KV sequence lengths for each batch

### Files Modified
*    `torch/nn/attention/experimental/_paged_attention.py`: Added `kv_len` parameter as a tensor to `get_mask_mod` and applied upper mask to the new mask mod.
*   `test/inductor/test_flex_attention.py`: Fixed all related `kv_len` parameter call in the tests
*   `test/inductor/test_flex_decoding.py`: Fixed all related `kv_len` parameter call in the tests

Issue 2: Invalid Memory Access (IMA) in Triton Kernels
------------------------------------------------------

### Background

The Triton kernel for flex attention was experiencing invalid memory access errors when running with compute sanitizers, particularly with short KV sequences and small batch sizes.

### Root Cause

*   Kernel launches CTAs (Cooperative Thread Arrays) proportional to GPU's multi-processor count (108 via `SPLIT_KV`)
*   With small workloads, many CTAs remain idle but still attempt to access `kv_indices` with invalid `indices_idx` values
*   This caused out-of-bounds memory access violations

### Solution

Implemented boundary checks with early exit:

1.  **Added `MAX_VALID_KV_IDX` parameter** in `torch/_inductor/kernel/flex/flex_decoding.py`

    *   Calculate maximum valid KV index based on actual `kv_indices` tensor size and pass it to Triton template
2.  **Added early exit logic** in `torch/_inductor/kernel/flex/templates/flex_decode.py.jinja`

    *   Boundary checks before accessing `kv_indices` in both normal and full blocks
    *   Idle CTAs with invalid `indices_idx` skip computation entirely

This prevents invalid memory access while reducing wasted computation on idle thread blocks.

Testing & Validation
--------------------

### Accuracy Tests

*   Added comprehensive test cases covering KV sequences not divisible by block sizes
*   Verified output matches standard attention for various sequence length combinations

### Sanitizer Results

`========= COMPUTE-SANITIZER Starting standalone test_max_autotune... Running test_max_autotune on device: cuda max_autotune config: True test_max_autotune completed successfully! Test passed! ========= ERROR SUMMARY: 0 errors`

**Before**: More than 13720 invalid memory access errors with sanitizers
**After**: Clean execution with 0 errors

Both fixes work together to ensure paged attention produces accurate results while running safely without memory access violations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160861
Approved by: https://github.com/BoyuanFeng
2025-08-30 04:50:23 +00:00
76f81b56d3 [audio hash update] update the pinned audio hash (#161836)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161836
Approved by: https://github.com/pytorchbot
2025-08-30 04:23:04 +00:00
82d2d23e85 Add batch option for send/recv_object_list (#160342)
`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.

This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group.

---

BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”

See:

c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342
Approved by: https://github.com/wconstab
2025-08-30 03:29:09 +00:00
e015de1969 Revert "Use vectorized stores for all dtypes (#161649)"
This reverts commit f0a517e333d6204f560d8061a4f70523060c93bf.

Reverted https://github.com/pytorch/pytorch/pull/161649 on behalf of https://github.com/ngimel due to buggy ([comment](https://github.com/pytorch/pytorch/pull/161649#issuecomment-3238895967))
2025-08-30 03:13:40 +00:00
0af56fc33e Cleanup stale submodule directories after checkout (#161748)
Fixes https://github.com/pytorch/pytorch/issues/161510

Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   third_party/kineto (untracked content)

% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx  0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
2025-08-30 01:30:44 +00:00
8627a19adf [MPS] sparse add unary funcs + add for sparse tensors (#160839)
Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-30 01:09:00 +00:00
ebfee60101 [WIP] more aggressive persistent reduction (#161055)
Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335).

Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed.

As criteria:

- there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input)
- we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing
- we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved).

Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055
Approved by: https://github.com/jansel
2025-08-30 01:08:45 +00:00
6db872fa2c Revert "Cleanup stale submodule directories after checkout (#161748)"
This reverts commit 0e45023cf9cbe1cf18279c1b0d391ea9464e7731.

Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I still see the same failures, and could not understand, from the log whether those checks are running on not ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3238791895))
2025-08-30 01:04:11 +00:00
7c30a9d7fc [MPS] Add slow version of kthvalue (#161817)
Which heavily borrows implementation logic from `topk`
As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817
Approved by: https://github.com/dcci
2025-08-30 00:44:29 +00:00
c1e504ec2f [SymmMEM] Move AsyncTP tests to a seperate test class (#161820)
We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core
symmetric memory APIs when Async TP ops fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820
Approved by: https://github.com/kwen2501
2025-08-30 00:40:40 +00:00
4ad9fbc83a Unify TypeAlias definitions in optimizer.py (#161493)
Fixes #160834

This issue unifies TypeAlias definitions in [optimizer.py](https://github.com/pytorch/pytorch/blob/main/torch/optim/optimizer.py)

This ensures the following:

- Consistency and Standardization
- Enhanced IDE support
- Prevents runtime confusion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161493
Approved by: https://github.com/Skylion007
2025-08-30 00:35:02 +00:00
0f81e7f640 [CI] Fix XPU ci test permission issue (#161389)
Due to new test runners, refer https://github.com/pytorch/pytorch/actions/runs/17161094208/job/48694776064#step:2:124
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161389
Approved by: https://github.com/atalman
2025-08-30 00:03:59 +00:00
3daf20f8e1 [MPS] fix empty input in posneg functions (#161824)
fix empty posneg function for mps:
```python
import torch

input_tensor = torch.empty(0, device="mps")
out_pos = torch.isposinf(input_tensor)
```

Gives:
```
RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/Irakli_Salia/Desktop/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":551, please report a bug to PyTorch. Placeholder tensor is empty!
```

on main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161824
Approved by: https://github.com/malfet
2025-08-29 23:12:04 +00:00
3e459491b5 Enable XPU path for FlexAttention (#143553)
[#RFC153024](https://github.com/pytorch/pytorch/issues/153024)

**Motivation**

1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA  in the transformers repo on XPU device. Besides,  it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and  different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.

**What does this PR do?**

 1. Enable the device type for Flexattention kernel  and UTs to ensure all important UTs pass on XPU device.
 2. For E2E model inference, ensure the functionality  of LLM models inference with FlexAttention to be ready.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553
Approved by: https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Mao Yunfei <yunfei.mao@intel.com>
Co-authored-by: Xingyuan Li <xingyuan.li@intel.com>
Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: Xiao, Wang <wang.xiao@intel.com>
2025-08-29 23:10:58 +00:00
0e2c8af5a6 [CI/CD] Windows set git config --global core.ignorecase false (#161813)
Make sure git on windows have core.ignorecase false

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161813
Approved by: https://github.com/malfet
2025-08-29 23:04:43 +00:00
ea27464a79 [inductor][decompose k] disable on everything other than cuda (#161795)
# why

- untested so far

# what

- add an empty config heuristic for all devices for decompose k
- the cuda heuristic, because it is more specific, will still be picked
  up
- add notes explaining how to enable on other devices

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "decompose_k"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161795
Approved by: https://github.com/PaulZhang12
ghstack dependencies: #161767
2025-08-29 22:41:27 +00:00
45eccf414f [inductor][heuristics registry] missing heuristic is not an error anymore, cross device heuristics (#161767)
# why

- not having a heuristic is an error but should not crash, just provide 0 configs
- some heuristics are cross device type
- cleaner to be explicit about being cross device type than having to
  enumerate every possible device type

# what

- on registration, supply device_type=None (explicitly) to say this
  heuristic is cross device
- test to guard the heuristics hierarchies

# testing

```
python3 -bb -m pytest test/inductor/test_template_heuristics_registry.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161767
Approved by: https://github.com/PaulZhang12
2025-08-29 22:41:27 +00:00
037f3bd475 [CI] Migrate XPU build and test to python 3.10 (#161708)
Follow #161167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708
Approved by: https://github.com/malfet
2025-08-29 22:31:39 +00:00
6e548c1a87 Revert "[CI] Migrate XPU build and test to python 3.10 (#161708)"
This reverts commit 2a70d98abf8256d3d768eff028fca20198579824.

Reverted https://github.com/pytorch/pytorch/pull/161708 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing rocm jobs to fail. See: test/inductor/test_max_autotune.py::TestMaxAutotuneSubproc::test_max_autotune_addmm_search_space_EXHAUSTIVE_dynamic_True [GH job link](https://github.com/pytorch/pytorch/actions/runs/17303310877/job/49125664617) [HUD commit link](2a70d98abf) ([comment](https://github.com/pytorch/pytorch/pull/161708#issuecomment-3238359944))
2025-08-29 21:49:15 +00:00
eb78757708 [inductor] Lift fw_compiler and bw_compiler as toplevel functions. (#161762)
This is a no-op refactor to compiler_fx which lifts the logic of fw_compiler and bw_compiler to toplevel, so that they can be reused in a different stack (e.g. precompile).

Differential Revision: [D81292968](https://our.internmc.facebook.com/intern/diff/D81292968/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161762
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-08-29 21:46:55 +00:00
05eeb29976 [inductor][triton] support JITCallable._hash_lock (#161768)
Fixes #161618

Triton # 7974 introduces a threading.RLock() in JITCallable, which is not pickle-able. This PR adds this field to the list of un-pickleable fields that need to be handled specially.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161768
Approved by: https://github.com/xuzhao9
2025-08-29 21:20:02 +00:00
18b4fdde8f Add MTIA to floor_divide op (#161575)
Summary: Missed file in op registration resulting in fallback during test

Reviewed By: andyanwang, srsuryadev

Differential Revision: D81085615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161575
Approved by: https://github.com/albanD, https://github.com/malfet
2025-08-29 20:39:29 +00:00
f6368e934e Revert "[MPS] sparse add unary funcs + add for sparse tensors (#160839)"
This reverts commit 93c5112f46a978a029644ae599979416ead5c917.

Reverted https://github.com/pytorch/pytorch/pull/160839 on behalf of https://github.com/atalman due to test_sparse_csr.py::TestSparseCompressedCPU::test_consistency_SparseCSR_asinh_cpu_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17329155095/job/49201551217) [HUD commit link](93c5112f46) ([comment](https://github.com/pytorch/pytorch/pull/160839#issuecomment-3238093296))
2025-08-29 19:55:39 +00:00
bf6aaba0f7 [while_loop] avoid aliasing when body_fn never executes (#160670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160670
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160669
2025-08-29 19:36:37 +00:00
456493f7ed [while_loop][inductor] remove offset check for while_loop (#160669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160669
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-08-29 19:36:37 +00:00
c74e301455 Bump TorchBench version (#161461)
To include the latest fixes from TorchBench.  I'll setup a nightly commit hash update for this next

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161461
Approved by: https://github.com/malfet
2025-08-29 19:21:07 +00:00
67457dbb9d Fix non-const reference arguments in torch/csrc/jit/python/init.cpp (#161300)
Shouldn't be any generated code impact, just fixing bad practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161300
Approved by: https://github.com/wconstab, https://github.com/malfet
ghstack dependencies: #161286
2025-08-29 19:01:32 +00:00
e9bbd28f22 make einsum produce contiguous inputs in more cases (#161755)
Fixes #161729
Written by codex
This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755
Approved by: https://github.com/malfet, https://github.com/albanD
2025-08-29 18:50:46 +00:00
348d781055 [Inductor] Update Outer Reduction Heuristic (#159093)
Update outer reduction heuristics for significant speedups.

HuggingFace:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" />

Average ~20% speedup on a kernel by kernel basis

TorchBench:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" />

Average ~40% speedup on a kernel by kernel basis

<img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093
Approved by: https://github.com/jansel
2025-08-29 18:31:22 +00:00
303f514d5b [CI] Add basic CUDA 13.0 periodic test (#161013)
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161013
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
2025-08-29 17:56:33 +00:00
f532f99822 [AOTI] normalize_path_separator zip file path (#161781)
normalize_path_separator zip file path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161781
Approved by: https://github.com/angelayi
2025-08-29 17:53:41 +00:00
93c5112f46 [MPS] sparse add unary funcs + add for sparse tensors (#160839)
Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-29 16:28:58 +00:00
0f6a08a029 [inductor] Fix SubgraphInfo round trip (#161779)
Currently `numels` is not specific to a created subgraph since it is not retrieved by `dataclasses.fields(SubgraphInfo)` due to it not being type annotated, see [ref](https://docs.python.org/3/library/dataclasses.html#module-dataclasses:~:text=The%20%40dataclass%20decorator%20examines%20the%20class%20to%20find%20fields.%20A%20field%20is%20defined%20as%20a%20class%20variable%20that%20has%20a%20type%20annotation.%20With%20two%20exceptions%20described%20below%2C%20nothing%20in%20%40dataclass%20examines%20the%20type%20specified%20in%20the%20variable%20annotation.).

So for example the following would happen:

```
self.numels = {"x": sympy.Integer(5)}
subgraph_name = "<x>"
with self.create_subgraph_body(subgraph_name):
     self.numels = {"x", sympy.Integer(7)}
# this would print that x has size 7, not the original value of 5
print(self.numels)
# numels would be None because dataclasses.fields(SubgraphInfo) does not include numels
# since it is not type annotated
print(self.subgraph_bodies[subgraph_name])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161779
Approved by: https://github.com/eellison
2025-08-29 16:27:29 +00:00
c8fa907e74 Check commit order (#161560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161560
Approved by: https://github.com/malfet
ghstack dependencies: #161558, #161637
2025-08-29 16:22:58 +00:00
b99a112688 Update optional tag for interpolation in torch.quantile() (#161706)
Fixes #146156

Refix the issue with the extra needed fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161706
Approved by: https://github.com/soulitzer
2025-08-29 16:21:14 +00:00
cd6d63f453 [SymmMEM] Fix test_empty_strided_p2p_persistent (#161677)
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677
Approved by: https://github.com/kwen2501
ghstack dependencies: #161676
2025-08-29 16:11:58 +00:00
0e45023cf9 Cleanup stale submodule directories after checkout (#161748)
Fixes https://github.com/pytorch/pytorch/issues/161510

Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   third_party/kineto (untracked content)

% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx  0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
2025-08-29 14:07:06 +00:00
823a329984 Revert "Cleanup stale submodule directories in checkout action (#161748)"
This reverts commit f3c5a82139539c63e6f08966e268c4160e138320.

Reverted https://github.com/pytorch/pytorch/pull/161748 on behalf of https://github.com/malfet due to I put the check in the wrong place ([comment](https://github.com/pytorch/pytorch/pull/161748#issuecomment-3237080419))
2025-08-29 13:40:21 +00:00
f0a65cd6d6 Add pg argument to consolidate_safetensors_files_on_every_rank (#161421)
Summary: Based on feedback on https://github.com/pytorch/torchtitan/pull/1625, adding a pg argument to consolidate_safetensors_files_on_every_rank so that we don't infer the pg and users can supply one if needed.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D80954339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161421
Approved by: https://github.com/fegin
2025-08-29 13:31:11 +00:00
627decb0ed [DTensor] fix DTensorTestCase.destroy_pg() when device_type is "cpu" but CUDA device is available (#161015)
**Summary**
When `device_id` is not None, barrier() will choose the accelerator of the most
pripority, which means if the test specifies to use CPU for testing while CUDA is
available on the host, the barrier() will use CUDA. To avoid this and better respect
`self.device_type`, we add this branch to enforce barrier() to use CPU when
`self.device_type` is CPU and other accelerator is also available.

**Test**
`pytest test/distributed/tensor/test_dtensor_testbase.py`

**Debugging Output**
```
# from init_process_group()
init pg: backend=gloo, device_id = None
default_pg has backend: gloo, device_types: [device(type='cuda'), device(type='cpu')]

# from barrier()
barrier: device_ids = [10], devices = [], device = None, PG=[device(type='cuda'), device(type='cpu')]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161015
Approved by: https://github.com/tianyu-l
2025-08-29 12:47:11 +00:00
448a7e7e31 Fix SequentialLR deprecate warning about invoke step(epoch) (#149392)
Fixes #116776 #76113 #113222 #67958
## Changes

- Refactor `LRScheduler.step` method, leave `epoch` check logic in public method `step`
- Move update `lr` logic to `_update_lr` method
- Make `SequentialLR` use `_update_lr` to avoid unnecessary warning message

## Test Result

```bash
pytest test/optim/test_lrscheduler.py -vv
```

![image](https://github.com/user-attachments/assets/e1c5527e-193e-4328-bf95-023139ea0416)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149392
Approved by: https://github.com/janeyx99
2025-08-29 11:45:11 +00:00
ed370ae4b0 [unflatten] Fix test by supporting both MappingKey anf GetAttrKey (#161599)
Summary: As title

Test Plan:
Run internal tests

Rollback Plan:

Differential Revision: D81115712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161599
Approved by: https://github.com/tugsbayasgalan
2025-08-29 10:08:38 +00:00
5859edf113 [BE][inductor] replace "and" -> "logical_and" in bucketize_binary_search (#160941)
Get rid of these warnings:
```
/home/dberard/local/pytorch-env7/pytorch/torch/_inductor/runtime/triton_helpers.py:317: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors;
 please use '&' or '|' instead
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160941
Approved by: https://github.com/malfet, https://github.com/jingsh
2025-08-29 09:27:13 +00:00
5b701a6bb2 [AOTI][Intel GPU] Add XPU quantization ops to AOT Inductor. (#156572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156572
Approved by: https://github.com/EikanWang, https://github.com/angelayi
ghstack dependencies: #157430
2025-08-29 09:19:44 +00:00
48679ef966 [Refactor][XPU] Refactor XPU quantization op and add header files. (#157430)
This PR refactors the XPU quantization ops to align their code structure with the CPU implementation for consistency. It also adds necessary header files to enable future integration with AOTI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157430
Approved by: https://github.com/angelayi
2025-08-29 09:19:44 +00:00
0ca3a6085d use host+device_id to make sure devices are unique in rendezvous request (#161756)
Per title, for NVL72 systems where devices with the same indices on multiple hosts are within the same nvlink domain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161756
Approved by: https://github.com/kwen2501
2025-08-29 09:09:45 +00:00
a55d2beb50 [export] Support complex constant in serde (#161517)
Summary:

Fixes #160749

For a model like
```
class M(torch.nn.Module):
    def forward(self, x):
        s = torch.sin(x)
        z = 1j * s
        return z
```
Its graph will be
```
graph():
    %x : [num_users=1] = placeholder[target=x]
    %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%x,), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%sin, 1j), kwargs = {})
    return (mul,)
```

`1j` will appear as a constant complex argument in the `aten.mul`

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_complex_constant

Rollback Plan:

Differential Revision: D80672323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161517
Approved by: https://github.com/angelayi
2025-08-29 08:13:21 +00:00
d8a0bdb0d3 [BE][SymmMEM] Change Optional to the shorthand expression for symmetric memory modules (#161676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161676
Approved by: https://github.com/Skylion007
2025-08-29 07:31:16 +00:00
a7c949089a [vllm hash update] update the pinned vllm hash (#161752)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161752
Approved by: https://github.com/pytorchbot
2025-08-29 04:54:31 +00:00
a6456bfa85 [audio hash update] update the pinned audio hash (#161753)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161753
Approved by: https://github.com/pytorchbot
2025-08-29 04:52:58 +00:00
f3c5a82139 Cleanup stale submodule directories in checkout action (#161748)
Fixes https://github.com/pytorch/pytorch/issues/161510

Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M	libkineto/third_party/dynolog
M	libkineto/third_party/fmt
M	libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
	modified:   third_party/kineto (untracked content)

% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx  0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
2025-08-29 03:21:31 +00:00
5c306c3ccb [fx] Add lru_cache to warning (#161721)
Summary: Added lru_cache to the warning message to avoid flooding logs

Test Plan:
CI

Rollback Plan:

Differential Revision: D81245618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161721
Approved by: https://github.com/pianpwk
2025-08-29 02:25:45 +00:00
c1cb1cb26e fix tests caused by has_triton (#161737)
Summary: this will only cause it in the event that we are serializing a triton hop. there are a few tests that do weird mocking stuff that this function doesn't like, so this will prevent it from being called there.

Test Plan:
att

Rollback Plan:

Differential Revision: D81261486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161737
Approved by: https://github.com/angelayi
2025-08-29 02:25:35 +00:00
5cb1d71e59 [Flex] Fix float16 default config 128 headdim (#161647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161647
Approved by: https://github.com/v0i0
2025-08-29 01:48:06 +00:00
d153af713e [ez] Improve formatting in error messages for dynamic shapes (#161573)
Show the repr of `dim` to make the message more clear. Example: before `but got batch instead`, after `but got "batch" instead`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161573
Approved by: https://github.com/angelayi
2025-08-28 23:52:58 +00:00
9b67d8e344 Revert "[RELAND] Close some sources of fake tensor leakage (#161589)"
This reverts commit 5790b009751e6ebba35d3e6d05e7c1b135553eee.

Reverted https://github.com/pytorch/pytorch/pull/161589 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/17305150611/job/49128381649) [HUD commit link](5790b00975) ([comment](https://github.com/pytorch/pytorch/pull/161589#issuecomment-3235224249))
2025-08-28 23:19:36 +00:00
47742081c9 Revert "kill allow_complex_guards_as_runtime_asserts (#160198)"
This reverts commit 69d91b94ba5366f4444d8cb8fd3dab4de4f04d3d.

Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/jeffdaily due to let's revert again instead of waiting for forward fix, see earlier comments ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3235165462))
2025-08-28 22:50:37 +00:00
fffa62fa12 Ensure large tensor int32 -> int64 indexing is enabled (#157767)
Fixes: #https://github.com/pytorch/pytorch/issues/157446

I think that this delta is worth the switch form block-ptrs especially since they are deprecated

## Perf Summary

A is nightly B is this diff, so `negative` means this diff improves perf

TOP 5 differences
<img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" />

<details>
  <summary><strong>Full perf table (click to expand)</strong></summary>

| attn_type | dtype | shape(B,Hq,M,Hkv,N,D) | TFlops Version A | TFlops Version B |
| --- | --- | --- | --- | --- |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 258.38834144791923 | 258.6353685004612 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.2192450677751 | 140.12393320464972 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 122.32683823617003 | 118.51603755647925 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.48556906165314 | 137.24259849208627 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 86.59814488695922 | 84.59431398586257 |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 288.52679758135764 | 292.9174195871856 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 172.25541683643277 | 172.94326459828508 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 164.40864610599826 | 165.035129576335 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 176.54876886433945 | 175.08057670028145 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 125.22491679812626 | 121.06201152859151 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 339.11952481874283 | 339.0132835601695 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 227.58583240284406 | 228.21824999409597 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 185.98569659868966 | 182.32850843255093 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 188.9495725191772 | 180.31385312481657 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 106.25789530994302 | 106.55084959448476 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 357.6430536888533 | 363.30843452247274 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 262.3241154406613 | 265.73250045488 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 249.30498953911416 | 249.35928192833785 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 224.74126243851808 | 223.71776504077988 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 168.26977014013707 | 165.47991483333809 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 382.8178701785897 | 384.34752965862685 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 308.1449710013853 | 311.0653716044644 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 251.96365252505072 | 243.92283557225903 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 226.69316232745368 | 215.22769268913356 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 153.34142545296405 | 151.9312673939401 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 396.0998000753126 | 398.35036286102473 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 333.5198415274966 | 344.6354466169716 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 310.5955933379696 | 305.66347819546 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 260.4012412689896 | 259.758666997307 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 234.13034252182635 | 227.61676497283614 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 396.17615538477196 | 401.1419104525502 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 359.98648311998414 | 360.8285563463094 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 291.97720707257736 | 281.41694809965253 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 250.1703628419691 | 238.556760291579 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 199.50782826294306 | 191.52327358439223 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 411.0632004785396 | 413.6362648405517 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 382.9404387613185 | 397.74886235657607 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 357.0998545146633 | 350.5115200772392 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 281.8033924428203 | 281.98601309215843 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 282.56595134222135 | 277.4565795466672 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 408.89838018149516 | 405.14531386840076 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 396.07662058160264 | 393.4598228299578 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 317.8822887267849 | 304.754931401036 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 265.8801304948243 | 254.22961974295112 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 227.87390579965614 | 222.19481980110393 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 427.36821778477025 | 431.3766620314935 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 410.67994346825 | 423.4666944003808 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 381.1968748374038 | 381.77668006420424 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 292.5540046358546 | 296.5439130720502 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 321.04573768858114 | 310.7423616656888 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 427.46148866769903 | 426.162091037068 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 419.75580537687347 | 421.88640120274334 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 337.3208051798903 | 327.4912454675092 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 276.5638854539581 | 262.988360558083 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 250.82791326036886 | 245.07367032501736 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 435.8055824506086 | 441.8803729460534 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 432.02638235921006 | 450.33161016596273 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 402.25525939224883 | 393.8564689669916 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 297.5337286675904 | 297.0131881135074 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 343.8697037899545 | 329.8194073407783 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 267.58912366821056 | 256.91606054118375 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 150.81723692609629 | 146.32172267858743 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 129.51029293209245 | 122.72144394093334 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 147.627656359087 | 141.68956350566188 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 87.55100546003591 | 84.91293287692788 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 299.5931492743986 | 305.884253766691 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 179.39026367843837 | 181.64741311605096 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 173.93547669282367 | 173.23972950980564 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 185.90234171599252 | 182.80844545446686 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 128.08176696266082 | 123.27722685662111 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 340.50674552770664 | 338.9071088484576 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 225.4438318650432 | 230.22899884832975 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 194.15123248528312 | 185.02793973094865 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 200.74289714108176 | 191.76606719670647 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 107.03564946728423 | 106.82432377861258 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 371.31799283918406 | 379.7555394732925 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 275.97762744310455 | 276.71106853992995 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 261.6648679783462 | 259.4127232060398 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 237.03108223577615 | 233.92710216149527 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 172.13926800371152 | 168.74390922407585 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 381.50199487767276 | 383.9043681999597 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 307.9748883093411 | 312.2403515462001 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 251.11319684705438 | 243.17870127827277 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 236.3253127246763 | 223.81250201769552 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 154.55693991756874 | 153.11360584987685 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 407.11400078586615 | 413.53709886086557 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 348.1705797722622 | 360.09771155957367 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 321.8593280850388 | 318.2882327401255 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 270.089032013835 | 268.767323026064 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 238.07324557907788 | 228.09842078362692 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 399.8172853171901 | 401.0954526332136 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 363.4387330438581 | 364.13111024232677 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 294.1752429133857 | 283.7235663368415 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 256.8389394007649 | 246.91771015606483 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 199.3378564292656 | 192.40439590901758 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 425.5150965556111 | 430.8190098707553 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 396.00437184073013 | 411.3873625655787 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 369.92803661607815 | 361.43244467343663 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 293.4277354412933 | 295.2529537595746 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 288.0208673072841 | 281.51896404878863 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 408.3005367220567 | 408.96116482298913 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 396.90095962766304 | 396.87385456176486 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 319.0534576137999 | 302.50950358107764 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 270.3334977708081 | 258.8506349486557 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 227.46824134365394 | 222.23759438128766 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 438.24247309479694 | 437.7975163205371 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 428.34012029699227 | 433.3215899950434 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 386.52672049728875 | 388.26216893354984 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 302.71976814728083 | 302.3574867306459 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 327.39760662780986 | 308.6348428844912 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 423.31308678262695 | 426.6306972137279 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 412.6983690923106 | 419.4961977664297 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 337.41003544742273 | 324.2155049126126 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 278.7755890910794 | 265.9194286636502 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 251.55678254755364 | 244.8843180141462 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 452.5930781172308 | 457.7117122300742 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 445.05676260348116 | 463.9304535499636 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 415.78302138389415 | 406.29229555271456 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 308.0311067300895 | 304.91354721414314 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 351.43943626809335 | 329.4476923070317 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 295.1801525813241 | 291.36521287398904 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 183.23250549178067 | 182.35421238887605 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 151.56832453117747 | 151.3422139154794 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 171.02111935180432 | 160.72516856727913 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 74.05765122783826 | 74.5885345035243 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 314.3587394591763 | 319.2938677773619 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 224.57002084153177 | 225.48868542008177 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.00964804143052 | 215.39576159953486 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.1174237618258 | 214.28437413525663 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 121.08920423648368 | 119.55813661872644 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 362.2193857281911 | 360.05005804275936 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 279.8840217430121 | 279.5437918286659 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 227.76617121021982 | 222.8655938229316 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 215.43141176970562 | 207.71852284994702 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 121.35588364218539 | 121.20636565046884 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 365.1545280898012 | 373.37585444987326 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 304.360119952975 | 309.1247297936263 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 287.2603904544586 | 289.25547903162595 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 257.9852675272418 | 257.59069234098115 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 188.35158496670232 | 184.24683960154857 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 389.9744911369211 | 388.43466897254166 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 345.9228295166513 | 342.63034895210126 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 279.56334658247437 | 271.2724375402088 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 245.66477202810066 | 233.49688207371258 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 170.3270720653187 | 166.23863845657382 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 400.0041140827554 | 402.11182445396497 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 363.64641830327434 | 375.9288663364792 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 341.5776139573363 | 335.1160003213424 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 281.1811770268521 | 280.21438270014005 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 247.78716118997716 | 245.3269825179633 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 403.794126680488 | 405.2353919019577 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 387.079178426863 | 385.1461762057035 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 309.7847188173431 | 298.0443968374749 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 262.4721750159666 | 250.81679725428586 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 205.70866004479979 | 202.9620839129557 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 413.380982988662 | 418.40270594263103 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 398.450064800682 | 409.6794973994029 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 372.26297458194466 | 364.44415106552196 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 293.0818569905912 | 292.85172400643984 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 296.46717085592087 | 285.76362010612763 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 419.3186786037592 | 426.08801580934437 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 408.1648467766632 | 409.4122254207817 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 329.24396020457345 | 313.5200995121138 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 274.61257504571876 | 255.7801815432177 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 232.63806001220684 | 230.03020843492314 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 435.0785891054788 | 440.39101804225345 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 424.86925312752817 | 435.18898057396825 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 393.000417896268 | 395.11543361225256 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 297.7755459218185 | 300.7208114715287 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 331.71570861760534 | 318.07127352552885 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 424.58602747137405 | 425.84897078470715 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 422.66607285025725 | 423.5524945535485 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 344.8625760048626 | 331.6793888458635 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 282.0787281511649 | 263.7895634445868 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 252.7301927385177 | 245.41844170037427 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 437.0658069164588 | 442.9101960063628 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 433.13788271434646 | 452.3873572709863 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 404.0959191546953 | 396.7077863894884 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 300.45502211883206 | 301.3439134717943 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 344.11003202413934 | 330.8897663350314 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 298.4364205341705 | 291.6793556507056 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 187.6382133139633 | 191.05409897308772 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 156.55822078636112 | 154.178925976516 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 173.47765221825162 | 169.30862508068464 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 74.5885345035243 | 74.52689061607104 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 323.12233826013045 | 328.53889207933514 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 236.75872140126316 | 235.8378325547398 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 227.17836523816675 | 226.75357076139966 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 224.07209453308036 | 224.07209453308036 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 122.85572156047981 | 121.11642183704716 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 361.3123326658092 | 360.71014086458337 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 281.5287983927017 | 281.94301754758345 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 232.7456696285686 | 226.50976826432776 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 221.5612361744038 | 214.96188822837055 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 121.38311528944315 | 120.85441868178513 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 380.2579019244734 | 389.2520157863988 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 316.95230660496924 | 317.87597790618906 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 301.07968126657323 | 298.02424098422983 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 267.2240756921594 | 267.16353549228154 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 189.82761622494257 | 186.736450261963 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 389.88665375406805 | 387.9125133037077 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 348.70619958684887 | 346.6750499749774 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 280.5472989906087 | 271.22300822012187 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 250.02397620165968 | 241.22532776331445 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 171.67817496107645 | 166.95679280483972 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 412.626880230807 | 417.60238657950777 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 374.8829313933945 | 389.4448546468815 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 353.20410434172436 | 345.7072490717473 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 292.51045924209586 | 291.66621022138287 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 251.6264062063495 | 248.45110052911542 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 404.0155784550126 | 401.90546837237514 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 384.4389015599863 | 386.9684324594344 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 313.3731284132225 | 298.17074251037894 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 264.19199737284265 | 252.8982463999916 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 207.03696315185684 | 202.86697323136772 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 428.2436763312506 | 433.45005568619536 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 411.8516531869893 | 428.2753623461049 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 384.9095037182509 | 372.90888743000744 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 303.2438915629836 | 302.05095952914337 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 301.8689122735564 | 285.0363190513223 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 423.13592231504805 | 420.3991500185611 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 407.44527331585493 | 408.5064370765247 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 330.50050996167414 | 316.8763979925965 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 274.6833786307413 | 259.86098862141324 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 232.24019584158367 | 226.52040268160232 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 444.4596314237808 | 455.99558915752266 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 437.4245561244369 | 455.98275147271966 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 397.3350686877605 | 397.88875599028063 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 308.53809114394545 | 307.1359822042007 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 331.32379843423774 | 316.85293191675646 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 422.4622274366379 | 425.0407156418684 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 420.9547052783101 | 430.33779243510276 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 345.50265346504085 | 332.094855328957 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 280.81715528243365 | 264.6543640282054 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 252.25635200421783 | 245.46235499490305 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 452.5524207341139 | 461.7512032176736 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 445.2316469907137 | 464.4523799578466 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 416.87264016717023 | 409.17124592157046 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 309.42579489389846 | 307.9734464665731 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 350.50782004300623 | 330.98959545427294 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767
Approved by: https://github.com/Skylion007
2025-08-28 22:43:59 +00:00
c0ed87c82d [Dynamo] Fix weakref.proxy error when torch.compile (#161508)
Fixes #159258

The error occurs when we attempt to create a weak reference from a weak reference proxy.
e9d42b3880/torch/_dynamo/guards.py (L2910-L2915)

In fact, we shouldn't create a weak reference from another reference or proxy, as it would check in CPython.
f60f8225ed/Objects/weakrefobject.c (L410-L418)

However, `__weakrefoffset__` is not equal to **0** when the `guarded_object` is in `weakref.ProxyTypes`, and it will wrongly create a weak reference for the `weakref.ProxyTypes`. I think this could be a bug from CPython, but we can prevent it by adding more weakref type checks (`weakref.ProxyTypes` contains `weakref.ProxyType` and `weakref.CallableProxyType`) here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161508
Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/malfet
2025-08-28 22:34:18 +00:00
1069a08dac Enable more nightly tests on s390x (#160893)
Enable more nightly tests on s390x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160893
Approved by: https://github.com/malfet
2025-08-28 22:20:55 +00:00
1190b7f73e Support Triton kernels in SAC region (#161541)
SAC interaction with triton kernel:
- In eager, triton ops are not dispatchable, and so it is always ignored by SAC,  i.e., always recomputed.
- In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager.
- If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/xmfan
2025-08-28 21:15:25 +00:00
f46e4bcf43 Revert "Add ciflow/vllm to vLLM commit hash update PR(s) (#161678)"
This reverts commit 0e358050304c6a350dae2bce497bd1867ecc3c9f.

Reverted https://github.com/pytorch/pytorch/pull/161678 on behalf of https://github.com/yangw-dev due to we want to keep the vllm pinn updated now, right now we have some failure ([comment](https://github.com/pytorch/pytorch/pull/161678#issuecomment-3234876332))
2025-08-28 20:42:19 +00:00
496052faf6 [inductor][decompose-k] make part of template heuristics (#161098)
# why

- enable it to go through commont template heuristics point
- make easier to use in common extension point e.g. lookup table

# what

- break template heuristic into base + triton
- move k_split generation logic into a templateheuristic for decompose k
- register through normal mechanism

- to make testing work, add a context manager to temporarily set
  template heuristics for a template/op to empty (effectively skipping
  it). This is used for decompose k test to disable triton choices

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670918](https://our.internmc.facebook.com/intern/diff/D80670918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161098
Approved by: https://github.com/jansel
ghstack dependencies: #161026, #161097
2025-08-28 20:14:48 +00:00
f641effe19 [inductor][ez] move template heuristics into dir (#161097)
# why

- simplify the expansion of heuristics beyond just triton (e.g.
  decomposeK)

# what

- move template heuristics and registry into its own folder
- adjust imports accordingly

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D80670917](https://our.internmc.facebook.com/intern/diff/D80670917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161097
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
ghstack dependencies: #161026
2025-08-28 20:14:48 +00:00
688acf0b83 [inductor][mm] restructure decompose k (#161026)
# why

- make it easier to integrate into lookup table later

# what

- current version generates templates on the fly and uses them
  to generate a single choice
- lookup table and performance model work best when there is a
  stable set of templates (with predictable names) and those
  are then parametrized
- this change makes it so that there is a single DecomposeK template
  with a stable name, and the k split is the only parametrization we do

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_decompose_k_dynamic_False_bfloat16_sizes1 -v
```

Differential Revision: [D80670913](https://our.internmc.facebook.com/intern/diff/D80670913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161026
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
2025-08-28 20:14:41 +00:00
f0a517e333 Use vectorized stores for all dtypes (#161649)
resurrecting #151818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161649
Approved by: https://github.com/Skylion007
2025-08-28 20:06:29 +00:00
bacdd985a9 [PT2] Add fastResizeToZero to all static dispatch kernels (#161679)
Summary:
Add fastResizeToZero whenever we are reusing output tensors. Otherwise it keeps throwing warning
```
Warning: An output with one or more elements was resized since it had shape [10], which does not match the required output shape [181]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
```

Test Plan:
Run local replayer.

```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=786096203
SNAPSHOT_ID=11

HARDWARE_TYPE=1 ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 3443 2>&1 | tee ~/logs/${MODEL_TYPE}/predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}

sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 1000 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_ads_mtml_offsite_cvr_oba_optout_dedicated_model_100 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} false 3443 false 2>&1 | tee ~/logs/${MODEL_TYPE}/replayer_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}
```

Before: P1921177565

After: P1921178087

Rollback Plan:

Differential Revision: D81177596

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161679
Approved by: https://github.com/henryoier
2025-08-28 19:58:40 +00:00
1621b5494c Removed redundant dtype conversion in scaled_dot_product_attention docstring example (#161613)
Suggested changes done for Fixes #161611.

Removed the line attn_bias.to(query.dtype) entirely

Fixes #161611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161613
Approved by: https://github.com/mikaylagawarecki
2025-08-28 19:58:07 +00:00
69d91b94ba kill allow_complex_guards_as_runtime_asserts (#160198)
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D79903317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
2025-08-28 19:36:19 +00:00
b76f6d117a [ROCm] fix numpy version detection and adjust fudge_factors for MI355 (#161429)
This PR fixes:

- Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions)
```
test_quantization.py::TestDynamicQuantizedOps::test_qlinear
test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy
test_quantization.py::TestQuantizedLinear::test_qlinear
test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_tanh
test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32
```
- A couple of SDPA tests on MI355 by adjusting fudge_factors:

```
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429
Approved by: https://github.com/jeffdaily
2025-08-28 19:32:09 +00:00
130e50afff [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos, https://github.com/atalman
2025-08-28 18:57:34 +00:00
30ab87c884 [inductor] don't append None to choices (#161672)
Summary: don't append None as a choice to choices in autotune

Test Plan: See internal Diff

Differential Revision: D81188644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161672
Approved by: https://github.com/angelayi
2025-08-28 18:48:50 +00:00
049c08eda8 Revert "[dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934)"
This reverts commit 8f31aa97a3e1e17bed29b6cedf9884f0c6b145e9.

Reverted https://github.com/pytorch/pytorch/pull/160934 on behalf of https://github.com/anijain2305 due to causes memory leak leading to OOMs ([comment](https://github.com/pytorch/pytorch/pull/160934#issuecomment-3234426359))
2025-08-28 17:56:36 +00:00
affd071858 [export] serialization support for triton_kernel_wrapper_functional (#161314)
Summary: att

Test Plan:
buck2 test mode/opt //caffe2/test:test_export -- test_triton_hop

Rollback Plan:

Differential Revision: D80827767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161314
Approved by: https://github.com/angelayi
2025-08-28 17:42:47 +00:00
dac062f23b Add aoti to mps benchmarks (#160741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160741
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-08-28 17:32:29 +00:00
2a70d98abf [CI] Migrate XPU build and test to python 3.10 (#161708)
Follow #161167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161708
Approved by: https://github.com/malfet
2025-08-28 17:27:11 +00:00
eqy
55c289d5c1 [cuBLASLt][FP8] cuBLASLt appears to support float8 rowwise-scaling on H100 (#161305)
Following #157905 I think the macro around
```
  TORCH_INTERNAL_ASSERT(use_rowwise == false, "rowwise scaled_gemm not supported with blaslt");
```
was never updated and this would cause `float8` tests to fail. Also it appears the `Lt` accepts two inputs with `e4m3` and `e5m2` dtypes simultaneously, so removing that check here as well...

CC @lw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161305
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-28 17:04:25 +00:00
2042d2174a [MPS] Migrate round unary op to Metal (#161712)
And actually use the right function, as [`torch.round`](https://docs.pytorch.org/docs/stable/generated/torch.round.html) doesn't use `std::round`, but rather `std::rint`, which can be easily seen by running something like
```python
import torch
print(torch.arange(-3., 3., step=.5, device='mps').round())
print(torch.arange(-3., 3., step=.5, device='mps').cpu().round())
```

Before this change it printed
```
tensor([-3., -3., -2., -2., -1., -1.,  0.,  1.,  1.,  2.,  2.,  3.], device='mps:0')
tensor([-3., -2., -2., -2., -1., -0.,  0.,  0.,  1.,  2.,  2.,  2.])
```
But after this change results match

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161712
Approved by: https://github.com/dcci
2025-08-28 16:45:07 +00:00
4fd761fecc [DTensor] Wrap sharding prop error with contextual exception (#161574)
Mainly, this helps tell the user more info about the operator that
failed to run if it fails during sharding propagation.

Previously, only this exception would be raised:
```
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')
```

Now you get both the above exception as well as

```
The above exception was the direct cause of the following exception:
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```

<stacktrace omitted>
<details><summary>detailed error</summary>

```
======================================================================
ERROR: test_linear (__main__.TestDTensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 668, in wrapper
    self._join_processes(fn)
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 932, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 972, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 4 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 150, in dispatch
    self.sharding_propagator.propagate(op_info)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 309, in propagate
    OutputSharding, self.propagate_op_sharding(op_info.schema)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__
    return self.cache(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 329, in propagate_op_sharding_non_cached
    op_strategy = self.op_strategy_funcs[op_schema.op](strategy_schema)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 673, in reshape_strategy
    input_tgt_placements, output_placements = propagate_shape_and_sharding(
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 601, in propagate_shape_and_sharding
    in_dim = get_in_dim_to_shard(cmd)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 537, in get_in_dim_to_shard
    raise RuntimeError(
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 816, in run_test
    getattr(self, test_name)()
  File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 670, in wrapper
    fn()
  File "/data/users/whc/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 490, in wrapper
    raise e
  File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 487, in wrapper
    func(self, *args, **kwargs)  # type: ignore[misc]
  File "/data/users/whc/pytorch/test.py", line 60, in test_linear
    print("results: ", distributed_linear(distributed_input))
  File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/nn/modules/linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
  File "/data/users/whc/pytorch/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn
    return fn(*args, **kwargs)
  File "/data/users/whc/pytorch/torch/distributed/tensor/_api.py", line 358, in __torch_dispatch__
    return DTensor._op_dispatcher.dispatch(
  File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 163, in dispatch
    raise RuntimeError(
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161574
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-08-28 15:56:15 +00:00
a8270dd124 Revert "kill allow_complex_guards_as_runtime_asserts (#160198)"
This reverts commit 196232bb935cb346f143d5c39e9a73c44121a033.

Reverted https://github.com/pytorch/pytorch/pull/160198 on behalf of https://github.com/atalman due to dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTestsCUDA::test_compile_selective_checkpoint_triton_kernel_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17289619543/job/49074475338) [HUD commit link](196232bb93) ([comment](https://github.com/pytorch/pytorch/pull/160198#issuecomment-3234013520))
2025-08-28 15:40:37 +00:00
63632fc7ee Add new_zeros dtype variant to the shim and as a stable op (#161597)
In case we want this before 2.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161597
Approved by: https://github.com/mikaylagawarecki
2025-08-28 13:57:24 +00:00
05d0f11dbd Revert "Add test coverage to tf32 in max autotune mm configs (#161545)"
This reverts commit e9d34b2438d65d6d16109e2416f3698de20f85c2.

Reverted https://github.com/pytorch/pytorch/pull/161545 on behalf of https://github.com/atalman due to inductor/test_max_autotune.py::TestMaxAutotuneRemoteCache::test_get_mm_configs_float32_precision_ieee [GH job link](https://github.com/pytorch/pytorch/actions/runs/17283985553/job/49058214260) [HUD commit link](e9d34b2438) ([comment](https://github.com/pytorch/pytorch/pull/161545#issuecomment-3233569771))
2025-08-28 13:46:47 +00:00
ef0483d74c Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767)"
This reverts commit b36a20d368733740a8507b3109d193c88930323a.

Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3233558168))
2025-08-28 13:44:41 +00:00
5432966253 Revert "Remove test since it ooms on CI (#161644)"
This reverts commit 443452ca2f5beef58019f4e7e7e31c0526aee0fc.

Reverted https://github.com/pytorch/pytorch/pull/161644 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/157767 internal tests ([comment](https://github.com/pytorch/pytorch/pull/161644#issuecomment-3233550883))
2025-08-28 13:41:58 +00:00
e9975f501c Revert "Support Triton kernels in SAC region (#161541)"
This reverts commit 149c68071ca033d5e3427e63e05d9969bd4961e4.

Reverted https://github.com/pytorch/pytorch/pull/161541 on behalf of https://github.com/malfet due to Broke some tests in trunk workflow, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=trunk%20%2F%20linux-jammy-cuda12.8 ([comment](https://github.com/pytorch/pytorch/pull/161541#issuecomment-3233457206))
2025-08-28 13:14:53 +00:00
07f76517e7 [Inductor][WIndows] Fix Windows test case failure. (#161497)
Fixes windows test case failures:
- TritonCodeGenTests.test_inductor_sequence_nr
- TritonCodeGenTests.test_indirect_device_assert
- CompiledOptimizerTests.test_static_address_finalizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161497
Approved by: https://github.com/jansel
2025-08-28 12:40:42 +00:00
3519969e4f [Intel GPU] Enable tensor memory descriptor in triton template for XPU. (#161600)
As Intel Triton now supports tensor descriptor, this PR updates the pinned Intel Triton version and introduces support for Triton MM template with tensor descriptor on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161600
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-28 12:39:58 +00:00
5790b00975 [RELAND] Close some sources of fake tensor leakage (#161589)
Reland of https://github.com/pytorch/pytorch/pull/159923

Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability.

2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict

3. We modify yolov3 to fix the previous silent incorrect behaviour
4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking

When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.

Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589
Approved by: https://github.com/avikchaudhuri
2025-08-28 09:46:42 +00:00
2e77a08b95 [cuDNN][TF32] Account for TF32 in test_super_resolution_cuda (#161662)
cuDNN seems to be dispatching to TF32 kernels on B200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161662
Approved by: https://github.com/Skylion007
2025-08-28 08:42:34 +00:00
196232bb93 kill allow_complex_guards_as_runtime_asserts (#160198)
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D79903317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
2025-08-28 07:59:29 +00:00
fa76256603 Revert "[dynamic shapes] use prims_common contiguity in create_example_tensors (#160933)"
This reverts commit 33c3794533844236a6e30ba377e0a6802b279fc8.

Reverted https://github.com/pytorch/pytorch/pull/160933 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160933#issuecomment-3232305708))
2025-08-28 07:39:26 +00:00
d2d4a3c539 Select Algorithm clear feedback savers (#161654)
Add `clear_feedback_savers` and tests for the feedback functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161654
Approved by: https://github.com/masnesral
2025-08-28 06:56:03 +00:00
95516ad7e6 [4/N][SymmMem] Add get_remote_tensor + move up get_buffer and get_signal_pad (#161533)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

`get_remote_tensor `: return a symmetric tensor given a peer rank.

The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.

As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
2025-08-28 06:47:35 +00:00
ff9533970a [3/N][SymmMem] Expose offset field from handle (#161532)
As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471
2025-08-28 06:39:12 +00:00
b291dc9684 [2/N][SymmMem] Add MemPool allocator and tests (#161471)
(Porting most of #161008)

Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.

To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
    tensor = torch.arange(numel, dtype=dtype, device=device)
```

Added tests for both use cases above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-28 06:31:29 +00:00
0fd63fd88b Guard config copy for pickle errors (#161659)
Differential Revision: [D81168335](https://our.internmc.facebook.com/intern/diff/D81168335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161659
Approved by: https://github.com/zou3519
2025-08-28 06:27:48 +00:00
eec876deb6 [SymmMem] Isolate set_device tests to avoid hang (#161668)
`test_symmetric_memory.py` hangs like this:
```
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...
```

This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`.
However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function.

Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time.

Hang is gone now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668
Approved by: https://github.com/fegin
2025-08-28 05:43:49 +00:00
c83b43d7a8 [1/2]Add summary report for vllm build (#161565)
Demo Run
https://github.com/pytorch/pytorch/actions/runs/17259533323?pr=161565

<img width="1538" height="720" alt="image" src="https://github.com/user-attachments/assets/64f6d7b4-cac6-4c12-863c-b15514bb8810" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161565
Approved by: https://github.com/huydhn
2025-08-28 05:25:55 +00:00
d3d9eb4777 Error when TORCH_STABLE_ONLY is defined in TensorBase.h (#161658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161658
Approved by: https://github.com/albanD
2025-08-28 04:36:31 +00:00
a65db6dc4c [vllm hash update] update the pinned vllm hash (#161363)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161363
Approved by: https://github.com/pytorchbot
2025-08-28 04:14:19 +00:00
149c68071c Support Triton kernels in SAC region (#161541)
SAC interaction with triton kernel:
- In eager, triton ops are not dispatchable, and so it is always ignored by SAC,  i.e., always recomputed.
- In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager.
- If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541
Approved by: https://github.com/drisspg, https://github.com/zou3519
ghstack dependencies: #160781
2025-08-28 03:54:46 +00:00
bae01479c3 [Inductor UT] Re-enable test_torchinductor_opinfo.py on XPU. (#161477)
The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet.

To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477
Approved by: https://github.com/jansel
2025-08-28 03:29:21 +00:00
cyy
8939d151d0 Use std::apply for CPU code (#152526)
The supported compilers are recent enough to enable std::apply in C++17.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152526
Approved by: https://github.com/ezyang
2025-08-28 02:47:54 +00:00
5edc3d814f Add option for TorchDispatchMode to ignore torch.compile internals (#161648)
If TorchDispatchMode.ignore_compile_internals() is True, then we turn
off the TorchDispatchMode during the compilation process, instead
turning it back on during runtime of the compiled artifact.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648
Approved by: https://github.com/bdhirsh
2025-08-28 02:41:33 +00:00
199c3633bf Fix Inductor Periodic (#161617)
Models are now passing accuracy. # of graph breaks is larger because
these were not actually tested in CI (if the model fails accuracy we
do not assert on # of graph breaks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161617
Approved by: https://github.com/anijain2305
2025-08-28 02:36:08 +00:00
e9d34b2438 Add test coverage to tf32 in max autotune mm configs (#161545)
Add a test to make sure that the configs are using the correct setting of tf32 to prevent regression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161545
Approved by: https://github.com/coconutruben
2025-08-28 02:27:58 +00:00
be1612201d [export] Support AC HOP in pre-dispatch (#161479)
Adds the pre-dispatch handling for the AC hop. This lets the HOP pre-dispatch export without actually pre-dispatch tracing into it,. However, this is not sufficient to support AC in export:
- because the HOP body will still be in torch IR, so it will fail export verifiers
- the exported module also can't be ran in eager because the AC HOP relies on partitioner to embed RNG state saving/restoring

So it must be lowered by AOT Autograd into post-dispatch first before being executed, It suffices for my purposes though.

If users had checkpoint API use in their exported model, the behavior goes from silently incorrect to now be validation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161479
Approved by: https://github.com/ydwu4
ghstack dependencies: #161353
2025-08-28 01:46:25 +00:00
15670f9075 [dtensor] support local_map as a decorator (#161353)
And extract it out as a convenience function for dynamo to wrap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161353
Approved by: https://github.com/zpcore
2025-08-28 01:46:25 +00:00
0e35805030 Add ciflow/vllm to vLLM commit hash update PR(s) (#161678)
As it should be, otherwise, PR(s) like https://github.com/pytorch/pytorch/pull/161121 were merged without the signals it needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161678
Approved by: https://github.com/atalman
2025-08-28 01:35:04 +00:00
92c2daebb6 Add inductor provenance tracking artifacts to cache (#161440)
Summary:

- Add inductor provenance tracking artifacts to cache
- Update the tlparse version pin to `0.4.0`. The old tlparse version errors out on the new tlparse output. The lowest tlparse version that works is `0.3.42`.

tlparse error:
```
thread 'main' panicked at src/parsers.rs:671:71:
called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0)
stack backtrace:
   0:     0x55e4ff1c7f00 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h6d42cc84fc840290
   1:     0x55e4ff1ee503 - core::fmt::write::h5af61a909e3ec64d
   2:     0x55e4ff1c4c33 - std::io::Write::write_fmt::h5a7b54aa6e4a315d
   3:     0x55e4ff1c7d52 - std::sys::backtrace::BacktraceLock::print::h555579e7396c26ac
   4:     0x55e4ff1c8caf - std::panicking::default_hook::{{closure}}::h9128866118196224
   5:     0x55e4ff1c8b1a - std::panicking::default_hook::h52e9e7314e0255f6
   6:     0x55e4ff1c9652 - std::panicking::rust_panic_with_hook::h541791bcc774ef34
   7:     0x55e4ff1c93fa - std::panicking::begin_panic_handler::{{closure}}::h6479a2f0137c7d19
   8:     0x55e4ff1c8419 - std::sys::backtrace::__rust_end_short_backtrace::ha04e7c0fc61ded91
   9:     0x55e4ff1c908d - rust_begin_unwind
  10:     0x55e4fef7a030 - core::panicking::panic_fmt::h5764ee7030b7a73d
  11:     0x55e4fef7a406 - core::result::unwrap_failed::h3ff7104a9ace307a
  12:     0x55e4fefb3c56 - <tlparse::parsers::ArtifactParser as tlparse::parsers::StructuredLogParser>::parse::h20bc51a17ffc494a
  13:     0x55e4fef9669a - tlparse::run_parser::h20c7729f151eec62
  14:     0x55e4fef99a1b - tlparse::parse_path::he4892147f47fbade
  15:     0x55e4fef7c760 - tlparse::main::hdc05613b32f4f53b
  16:     0x55e4fef89263 - std::sys::backtrace::__rust_begin_short_backtrace::h15f188f3edf42596
  17:     0x55e4fef8827d - std::rt::lang_start::{{closure}}::he2c21e32a442538e
  18:     0x55e4ff1be0f0 - std::rt::lang_start_internal::h15895544e2012228
  19:     0x55e4fef83975 - main
  20:     0x7f0b3662a610 - __libc_start_call_main
  21:     0x7f0b3662a6c0 - __libc_start_main_alias_2
  22:     0x55e4fef7a610 - <unknown>
  23:                0x0 - <unknown>
```

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r  test_kernel_information_generation
python test/dynamo/test_structured_trace.py -k test_chromium_event
```

Differential Revision: D80976585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161440
Approved by: https://github.com/oulgen
2025-08-28 01:16:02 +00:00
768a1017c5 Allow parallel start NUMA binding (#161576)
# Context
In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.

However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).

The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:

> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.

But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.

I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)

The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.

# This PR
Remove restrictions against parallel start for NUMA binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576
Approved by: https://github.com/d4l3k
2025-08-28 01:15:58 +00:00
0c4a79b7e0 Replace some calls to new with make_{unique,shared} (#160581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160581
Approved by: https://github.com/malfet
2025-08-28 00:30:45 +00:00
9b02435e9f Improve Scheduler init duration (#161491)
Early exit merge_loops() if config.loop_ordering_after_fusion is false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161491
Approved by: https://github.com/jansel
2025-08-28 00:27:51 +00:00
fd60117051 [C10D] add _summarize_ranks util (#160284)
Prints ranges of ranks succinctly.

e.g.

For a strided list of ranks, summarizes down to start:stop:step
```
0:4096:512
```

Omits step if it's 1
```
0:8
```

Note: endpoints are exclusive. This may not be intuitive to everyone,
but in the first above the last rank is 3584, and in the second it is
7.

Currently, does not support combinations of striding _and_ range.  (e.g.
can not generate a representation like "0:2, 4:6, ..., 12:14".  Is this
needed / useful? If so it could be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284
Approved by: https://github.com/XilunWu
2025-08-28 00:17:53 +00:00
97a548b640 [PGO] skip allowlist logging for empty graphs (#161530)
Summary: reduces spurious logging

Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D81060182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161530
Approved by: https://github.com/bobrenjc93, https://github.com/mlazos
2025-08-28 00:12:13 +00:00
c55bdb26e1 Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)"
This reverts commit 378edb047f83dfb84c2d9c032bddebc5e0147b8f.

Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))
2025-08-27 23:45:12 +00:00
903181bb6f Revert "[2/N][SymmMem] Add MemPool allocator and tests (#161471)"
This reverts commit 4ed71d5412d58746d23f16689cab61da0e8149ef.

Reverted https://github.com/pytorch/pytorch/pull/161471 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/161471#issuecomment-3230069186))
2025-08-27 23:18:36 +00:00
ba201082b6 [TorchScript] ProfilingExecutor - RemoveProfileNodesAndSpecializeTypes None handling (#161538)
ProfilingGraphExecutor works like this:
1. do some unrelated JIT optimizations
2. Add profiling nodes to collect JIT information like tensor dtypes and shapes
3. Do some more unrelated JIT optimizations
4. Remove the profiling nodes and extract the tensor info, and then use the JIT tensor info to do optimizations.

This PR is intended to fix a bug in Step 4, where the profiling nodes were removed. It was previously assumed that all the things that were profiled were either Tensors or Optional[Tensor]s - otherwise, step 2 would not have introduced a profiling node.

However, we saw a case where step 3 would remove replace Optional[Tensor] inputs with `None` inputs (e.g. if a conditional that returned a Tensor or a None could be statically known to only follow the `None` branch).

To fix this, we essentially just modify the RemoveProfileNodesAndSpecializeTypes assert so that it accepts Tensors, Optional[Tensor]s, or None (the new part).

Note that this issue is probably somewhat uncommon (maybe why we didn't see it for the first 4 years that this code existed). I expect that, typically, any time that step 3 would convert `Optional[Tensor] -> None`, step 1 would have already done that. So it's difficult to reproduce in an end-to-end TorchScript workload.

Differential Revision: [D81068172](https://our.internmc.facebook.com/intern/diff/D81068172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161538
Approved by: https://github.com/nmacchioni
2025-08-27 23:12:15 +00:00
8fc2467fe5 Revert "[3/N][SymmMem] Expose offset field from handle (#161532)"
This reverts commit 68d395d61e9d4601ab1e2bca56eb28253572c662.

Reverted https://github.com/pytorch/pytorch/pull/161532 on behalf of https://github.com/atalman due to need to revert https://github.com/pytorch/pytorch/pull/161471 internal failure ([comment](https://github.com/pytorch/pytorch/pull/161532#issuecomment-3230016806))
2025-08-27 23:06:55 +00:00
30edac5da6 Updates to CuTe DSL template renderer (#161117)
# Summary
This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack.

<img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" />

Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack.

A bunch of score mods that me and Claude came up with , that exercise the actual ops.
``` Py

def causal_mask(score, b, h, q_idx, kv_idx):
    """Causal attention mask."""
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

def relative_bias(score, b, h, token_q, token_kv):
    """Relative position bias."""
    return score + torch.abs(token_q - token_kv)

def relative_bias_v2(score, b, h, token_q, token_kv):
    """Relative position bias with factor of 2."""
    return score + 2 * torch.abs(token_q - token_kv)

def times_two(score, b, h, q_idx, kv_idx):
    """Simple score modification that doubles the score."""
    return score * 2

def alibi_bias(score, b, h, q_idx, kv_idx):
    """ALiBi (Attention with Linear Biases) - used in some modern models."""
    # Different slopes for different heads
    slope = 2 ** (-8 * (h + 1) / 8)  # Simplified version
    return score - slope * torch.abs(q_idx - kv_idx)

def sliding_window(score, b, h, q_idx, kv_idx, window_size=256):
    """Sliding window attention - only attend to nearby tokens."""
    return torch.where(
        torch.abs(q_idx - kv_idx) <= window_size,
        score,
        float("-inf")
    )

def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64):
    """Block diagonal attention pattern."""
    q_block = q_idx // block_size
    kv_block = kv_idx // block_size
    return torch.where(q_block == kv_block, score, float("-inf"))

def additive_bias(score, b, h, q_idx, kv_idx):
    """Test simple addition with position-based bias."""
    return score + (q_idx + kv_idx) * 0.01

def multiplicative_decay(score, b, h, q_idx, kv_idx):
    """Test multiplication with distance-based decay."""
    distance = torch.abs(q_idx - kv_idx)
    return score * torch.exp(-0.1 * distance)

def sine_wave_bias(score, b, h, q_idx, kv_idx):
    """Test trigonometric functions."""
    return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64)

def log_distance_penalty(score, b, h, q_idx, kv_idx):
    """Test logarithmic operations."""
    distance = torch.abs(q_idx - kv_idx).float()
    return score - torch.log(1 + distance)

def alternating_mask(score, b, h, q_idx, kv_idx):
    """Test with alternating pattern - good for branch prediction."""
    return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf"))

def head_specific_pattern(score, b, h, q_idx, kv_idx):
    """Different behavior per attention head."""
    even_head = h % 2 == 0
    causal = q_idx >= kv_idx
    return torch.where(even_head & causal, score, float("-inf"))

def sparse_strided(score, b, h, q_idx, kv_idx, stride=4):
    """Sparse attention with strided pattern."""
    return torch.where(
        (kv_idx % stride == 0) | (q_idx == kv_idx),
        score,
        float("-inf")
    )

def causal_with_global(score, b, h, q_idx, kv_idx):
    """Causal mask but first few tokens are globally attended."""
    is_causal = q_idx >= kv_idx
    is_global = kv_idx < 4
    return torch.where(is_causal | is_global, score, float("-inf"))

def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2):
    """Dilated attention pattern - exponentially increasing gaps."""
    distance = torch.abs(q_idx - kv_idx)
    is_attended = (distance == 0) | ((distance > 0) & ((distance & (distance - 1)) == 0))
    return torch.where(is_attended, score, float("-inf"))

```

Example outputs:
```
[Test Suite]
Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128

[Test 1: none]
[No score_mod, flash='enabled'] Found flash_attncute: True
[No score_mod, flash='disabled'] Found flash_attncute: False
✓ Outputs match between flash enabled/disabled
✓ Output matches eager SDPA (rtol=0.001, atol=0.001)

[Test 2: causal]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 17879 / 134217728 (0.0%)
Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed)
Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed)

[Test 3: rel_bias]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 12836 / 134217728 (0.0%)
Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed)
Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed)

[Test 4: rel_bias_v2]
```

This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117
Approved by: https://github.com/mlazos
2025-08-27 23:01:31 +00:00
12c0cf3fab switch prefer_deferred_runtime_asserts_over_guards in export (#160111)
Summary:
In preparation for checking shape guards in export, this PR effectively switches `prefer_deferred_runtime_asserts_over_guards` to `False`, matching Dynamo.

Actually that's a lie: we switch it to `allow_complex_guards_as_runtime_asserts`, which is `False` by default but can be controlled via an internally API to be `True`. This makes the two flags synchronized, so we should be able to kill `allow_complex_guards_as_runtime_asserts` at this point.

Test Plan:
updated tests

Rollback Plan:

Differential Revision: D79734206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160111
Approved by: https://github.com/tugsbayasgalan
2025-08-27 22:51:10 +00:00
6b051d7de3 [BE] Refactor trymerge for readability (#161637)
Two changes:
- Extract getting the last_commit's sha into it's own function
- Rename merge_changes to merge_changes_locally to better explain it's functionality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161637
Approved by: https://github.com/seemethere, https://github.com/malfet
ghstack dependencies: #161558
2025-08-27 22:44:00 +00:00
ee0ec21191 Ensure that tensors are contiguous before using no-graph MPS impl (#161641)
Fixes #161640

Check if tensors are contiguous before using the no-graph implementation. Using the script in the issue above with this change I get expected results.

```
MPS contiguous result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061], device='mps:0')
MPS non-contig result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061], device='mps:0')
CPU non-contig result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161641
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-27 22:31:57 +00:00
7da02bf8af Skip const folding with symbolic expression (#161437)
Summary: When performing constant folding, we must skip over operators that have symbolic `fill_value`.

Test Plan:
CI

Rollback Plan:

Reviewed By: kalpit-meta-1

Differential Revision: D80965936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161437
Approved by: https://github.com/StellarrZ
2025-08-27 22:09:58 +00:00
1041805c1e [dynamo, nested graph breaks] prevent excessive recompilations (#159786)
Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object.

Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break.

Followup: we can skip guards on continuation functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817, #160138
2025-08-27 21:53:37 +00:00
6562646dab [dynamo, nested graph breaks] clean up comments and codegen (#160138)
Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures.

Also simplify some codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817
2025-08-27 21:53:37 +00:00
d0a242e547 [dynamo, nested graph breaks] support nested closures (#159817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678
2025-08-27 21:53:37 +00:00
3f8090809f [dynamo, nested graph breaks] support nested graph breaks x context managers (#159678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329
2025-08-27 21:53:37 +00:00
10d93325b1 [dynamo, nested graph breaks] support very simple nested graph breaks (#159329)
e.g. this graph breaks once now:
```python
import torch

torch._dynamo.config.nested_graph_breaks = True

def inner(x):
    x = x + 1
    torch._dynamo.graph_break()
    return x + 2

@torch.compile(backend="eager")
def outer(x):
    return inner(x)

print(outer(torch.ones(3)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329
Approved by: https://github.com/anijain2305
2025-08-27 21:53:37 +00:00
68fa882dad [dynamo] Correctly track mutation class source for MutableMappingVariable (#161568)
Fixes https://github.com/pytorch/pytorch/issues/161505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161568
Approved by: https://github.com/Lucaskabela, https://github.com/malfet
2025-08-27 21:47:17 +00:00
b9c6aa1e17 Revert "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)" (#161628)
This reverts commit ae1a706444d6c0a6019ffc936c8b36574335a5d5.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161628
Approved by: https://github.com/atalman
ghstack dependencies: #161625, #161626, #161627
2025-08-27 21:37:14 +00:00
b7b9fb9962 Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" (#161627)
This reverts commit c1145852a5eac96f5551b5d1805109ce4dc5e1fa.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161627
Approved by: https://github.com/atalman
ghstack dependencies: #161625, #161626
2025-08-27 21:37:14 +00:00
c03d8d4082 Revert "Generalize torch._C._set_allocator_settings to be generic (#156175)" (#161626)
This reverts commit 908c5cc4c0f22d141776bde47c296b5186691855.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626
Approved by: https://github.com/atalman
ghstack dependencies: #161625
2025-08-27 21:37:14 +00:00
clr
40f46b09c7 async_compile: Fix the wait method to actually wait (#161561)
This method never triggered. It's used in 2 tests and they pass, so no serious
concern.

Note that I did introduce and fix a latent bug, which is if we called
shutdown_compile_workers, jobs would crash with this change due to ready_future
being finished if we called wait.

However we only call wait in tests so that bug is fine.

The other behaviour, is that if you called shutdown, I believe we may
potentially block on your first triton compile after that, until the pool was
ready. This should correctly switch to direct mode, until the pool is ready on
later warmups.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161561
Approved by: https://github.com/masnesral
ghstack dependencies: #161452
2025-08-27 21:35:31 +00:00
clr
0d6597138c inductor: Log the specific triton kernel that fails (#161452)
Added a optional name argument to SubprocPool.submit.

We record this in a dictionary, and when raising exceptions, add the name.
We manage the lifecycle the same as the pending futures.

Added a specific testcase to make sure this logs correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161452
Approved by: https://github.com/masnesral
2025-08-27 21:35:31 +00:00
06ddaf1e0a Revert "Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" (#160999)" (#161625)
This reverts commit a818fa77e3a72271f144514ef349c5a666313205.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161625
Approved by: https://github.com/atalman
2025-08-27 21:34:12 +00:00
26d0ff1cba [AOTI-FX] Enhance launch grid FloorDiv replacement using sympy.together. (#161582)
# Feature
2d launch grids with dynamic shapes can contain sympy expressions like `floor(x / 128 + y / 128)`. This breaks the dynamic shapes tracer which only supports `FloorDiv`, and not `floor`.  To handle this case, call `sympy.together` prior to pattern matching to convert this to `floor((x + y) / 128)`. Then, we can recognize the pattern and map it to `FloorDiv(x + y, 128)`.

# Test plan
Added a custom Triton test exposing this. The test calls a 2d autotuned kernel with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161582
Approved by: https://github.com/nandesuka
2025-08-27 21:31:28 +00:00
c36d18d7e8 [rfc] aot precompile with custom backend api (#161383)
Adding a new feature to torch.compile(fullgraph=True) which "aot_compile" a function with given example inputs.

On user side it should look like:

```
def foo(x, y):
    return x + y

compiled_fn = torch.compile(fullgraph=True).aot_compile(((torch.randn(3, 4), torch.randn(3, 4)), {}))
```

This is different from the traditional `torch.compile` workflow where compiled object will be a drop-in replacement for the original eager model:
```
tensor input -> torch.compile() -> tensor output (and populates the cache entry)
```
`aot_compile` will instead return a compiled function as result, and it's purely functional and doesn't populate the compile cache entry in dynamo:
```
tensor input -> aot_compile() -> compiled function
```
The aot compiled function will be savable and loadable on disk as well:
```
torch.compile(fullgraph=True).aot_compile(...).save_compiled_function('my/path')
compiled_fn = torch.compiler.load_compiled_function("my/path")
```

Right now we treat compiler backend as a blackbox and it needs to implement the following interface to make compile artifacts serialzable:
```
class SerializableCallable:
    def save_compile_artifacts(): ....
    def load_compile_artifacts(): ....
```
We haven't implemented this for inductor yet, but this shouldn't be an issue since we gate this feature through `torch._dynamo.config.aot_compile` (which defaults to False), and this will be left as follow up PR to the current PR.

Differential Revision: [D80914270](https://our.internmc.facebook.com/intern/diff/D80914270/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161383
Approved by: https://github.com/tugsbayasgalan
2025-08-27 21:26:25 +00:00
014b98dd09 Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)"
This reverts commit 77bc959fe122bfd131e339ca36cab445a1860806.

Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))
2025-08-27 21:21:19 +00:00
38ed57d446 Revert "Updates to CuTe DSL template renderer (#161117)"
This reverts commit 1750cc80374a9dd22fc26701c0602ae11a62baf0.

Reverted https://github.com/pytorch/pytorch/pull/161117 on behalf of https://github.com/atalman due to will need to revert to unblock revert of https://github.com/pytorch/pytorch/pull/151314 ([comment](https://github.com/pytorch/pytorch/pull/161117#issuecomment-3229754295))
2025-08-27 21:17:25 +00:00
007935a802 [cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063
Approved by: https://github.com/Skylion007
ghstack dependencies: #160754
2025-08-27 21:15:01 +00:00
cbc53b7696 Update pybind11 submodule to 3.0.1 (#160754)
Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling.

Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754
Approved by: https://github.com/Skylion007
2025-08-27 21:15:01 +00:00
624bc36163 Ensure the comment id is always passed in to trymerge (#161558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161558
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-08-27 19:53:28 +00:00
06c7516994 [BE] Upgrade XPU support package to 2025.2 (#158733)
Including below changes,

- Add XPU support package 2025.2 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.1 build in CI to ensure no break issue until PyTorch 2.9 release
- Upgrade XPU support package from 2025.1 to 2025.2 in CD for both Linux and Windows
- Rename Linux CI job name & image name to n & n-1
- Update XPU runtime pypi packages dependencies of CD wheels
- Remove deprecated support package version docker image build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158733
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-27 19:33:38 +00:00
2efcf9d081 [dynamo] Fix graph break registry loading in fbcode (#161550)
Summary: Add `torch/_dynamo/graph_break_registry.json` as an internal dependency. Minor related fixes.

Test Plan:
Test on OSS.

Rollback Plan:

Differential Revision: D81078973

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161550
Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305
2025-08-27 19:25:15 +00:00
443452ca2f Remove test since it ooms on CI (#161644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161644
Approved by: https://github.com/BoyuanFeng
2025-08-27 19:11:29 +00:00
47ecd2042f [ONNX] Fix index_put_ usage (#161263)
Summary:
It's hard to understand how it's working in most of our models, but in general it looks like `aten::copy_` is replaced incorrectly.
There are two schemas for `aten::copy_`:
1. `aten::copy_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)`
2. `aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)`

According to the logic in the comments we don't need one of the parameters for `aten::index_put_`.

It seems logic has been inferred from ordinary `aten::copy` where there could be a third parameter which is `non_blocking` flag.

Depending on the execution environment the sliced copying can be replaced either by first schema or by second schema with explicitly setting default parameter to `False`.

If first schema is selected it will lead to the crash (which is easily to catch in our prod env). In case of the second schema selection, there is no crash, but the third parameter is treated as `accumulate` parameter of the `index_put_` function which doesn't make sense.

So, in any case usage of the third parameter must be removed from the `aten::copy_` replacement.

For more details and check this post:
https://fb.workplace.com/groups/1405155842844877/permalink/25337687649165028/

Test Plan:

The test fails in production envirounment only.
In the test env `non_blocking` flag is mapped as `False` to the `acumulate` flag, which doesn't cause test to fail, but has no sense in terms of flags mapping.

The export works without errors, before the fix it was failing with accessing by index out of bounds vector, like this:
```
   1095     _C._jit_onnx_log("Torch IR graph at exception: ", graph)
File ~/.bento/kernels/bento_kernel_gaia_ml/1578/bento_kernel_gaia_ml_binary-inplace#link-tree/torch/onnx/utils.py:636, in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module)
    629 _C._jit_pass_lower_all_tuples(graph)
    630 # in _jit_pass_onnx, symbolic functions are called for each node for conversion.
    631 # However, there are nodes that cannot be converted without additional context.
    632 # For example, the number of outputs from split (and whether it is static or dynamic) is unknown
    633 # until the point where it is unpacked by listUnpack node.
    634 # This pass does a preprocess, and prepares the nodes such that enough context can be received
    635 # by the symbolic function.
--> 636 _C._jit_pass_onnx_remove_inplace_ops_for_onnx(graph, module)
    637 _C._jit_pass_onnx_preprocess(graph)
    639 # onnx does not support tuples, so try to remove them
RuntimeError: vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)
```

The test script:
```
import torch as th
import tempfile

class CopyTest(th.nn.Module):
    def forward(
        self,
        input_th: th.Tensor
    ):
        to_fill = th.ones((3, 3))
        to_fill[:, 0] = input_th[:, 0]
        return to_fill

m = CopyTest()

test_tensor = th.zeros((3, 3))

with tempfile.NamedTemporaryFile() as f:
    th.onnx.export(
            m,
            (test_tensor,),
            f,
            export_params=True,
            opset_version=17,
            do_constant_folding=True,
            input_names=["input"],
            output_names=["features"],
            dynamo=False,
        )
```

The exported model test:
```
import torch
import onnx
import onnxruntime

model_name = '/home/ironsided/test_model.onnx'
onnx_model = onnx.load(model_name)
onnx.checker.check_model(onnx_model)

example_inputs = (torch.zeros(3, 3),)

onnx_inputs = [tensor.numpy(force=True) for tensor in example_inputs]
print(f"Input length: {len(onnx_inputs)}")
print(f"Sample input: {onnx_inputs}")

ort_session = onnxruntime.InferenceSession(
    model_name, providers=["CPUExecutionProvider"]
)

onnxruntime_input = {input_arg.name: input_value for input_arg, input_value in zip(ort_session.get_inputs(), onnx_inputs)}

# ONNX Runtime returns a list of outputs
onnxruntime_outputs = ort_session.run(None, onnxruntime_input)[0]

print(onnxruntime_outputs)
```

The produced result is correct:
```
Input length: 1
Sample input: [array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)]
[[0. 1. 1.]
 [0. 1. 1.]
 [0. 1. 1.]]
```

Rollback Plan:

Differential Revision: D80797028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161263
Approved by: https://github.com/justinchuby, https://github.com/jermenkoo
2025-08-27 18:53:13 +00:00
1750cc8037 Updates to CuTe DSL template renderer (#161117)
# Summary
This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack.

<img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" />

Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack.

A bunch of score mods that me and Claude came up with , that exercise the actual ops.
``` Py

def causal_mask(score, b, h, q_idx, kv_idx):
    """Causal attention mask."""
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

def relative_bias(score, b, h, token_q, token_kv):
    """Relative position bias."""
    return score + torch.abs(token_q - token_kv)

def relative_bias_v2(score, b, h, token_q, token_kv):
    """Relative position bias with factor of 2."""
    return score + 2 * torch.abs(token_q - token_kv)

def times_two(score, b, h, q_idx, kv_idx):
    """Simple score modification that doubles the score."""
    return score * 2

def alibi_bias(score, b, h, q_idx, kv_idx):
    """ALiBi (Attention with Linear Biases) - used in some modern models."""
    # Different slopes for different heads
    slope = 2 ** (-8 * (h + 1) / 8)  # Simplified version
    return score - slope * torch.abs(q_idx - kv_idx)

def sliding_window(score, b, h, q_idx, kv_idx, window_size=256):
    """Sliding window attention - only attend to nearby tokens."""
    return torch.where(
        torch.abs(q_idx - kv_idx) <= window_size,
        score,
        float("-inf")
    )

def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64):
    """Block diagonal attention pattern."""
    q_block = q_idx // block_size
    kv_block = kv_idx // block_size
    return torch.where(q_block == kv_block, score, float("-inf"))

def additive_bias(score, b, h, q_idx, kv_idx):
    """Test simple addition with position-based bias."""
    return score + (q_idx + kv_idx) * 0.01

def multiplicative_decay(score, b, h, q_idx, kv_idx):
    """Test multiplication with distance-based decay."""
    distance = torch.abs(q_idx - kv_idx)
    return score * torch.exp(-0.1 * distance)

def sine_wave_bias(score, b, h, q_idx, kv_idx):
    """Test trigonometric functions."""
    return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64)

def log_distance_penalty(score, b, h, q_idx, kv_idx):
    """Test logarithmic operations."""
    distance = torch.abs(q_idx - kv_idx).float()
    return score - torch.log(1 + distance)

def alternating_mask(score, b, h, q_idx, kv_idx):
    """Test with alternating pattern - good for branch prediction."""
    return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf"))

def head_specific_pattern(score, b, h, q_idx, kv_idx):
    """Different behavior per attention head."""
    even_head = h % 2 == 0
    causal = q_idx >= kv_idx
    return torch.where(even_head & causal, score, float("-inf"))

def sparse_strided(score, b, h, q_idx, kv_idx, stride=4):
    """Sparse attention with strided pattern."""
    return torch.where(
        (kv_idx % stride == 0) | (q_idx == kv_idx),
        score,
        float("-inf")
    )

def causal_with_global(score, b, h, q_idx, kv_idx):
    """Causal mask but first few tokens are globally attended."""
    is_causal = q_idx >= kv_idx
    is_global = kv_idx < 4
    return torch.where(is_causal | is_global, score, float("-inf"))

def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2):
    """Dilated attention pattern - exponentially increasing gaps."""
    distance = torch.abs(q_idx - kv_idx)
    is_attended = (distance == 0) | ((distance > 0) & ((distance & (distance - 1)) == 0))
    return torch.where(is_attended, score, float("-inf"))

```

Example outputs:
```
[Test Suite]
Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128

[Test 1: none]
[No score_mod, flash='enabled'] Found flash_attncute: True
[No score_mod, flash='disabled'] Found flash_attncute: False
✓ Outputs match between flash enabled/disabled
✓ Output matches eager SDPA (rtol=0.001, atol=0.001)

[Test 2: causal]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 17879 / 134217728 (0.0%)
Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed)
Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed)

[Test 3: rel_bias]
[With score_mod, flash='enabled'] Found flash_attncute: True
[With score_mod, flash='disabled'] Found flash_attncute: False
✗ Outputs differ between flash modes: Tensor-likes are not close!

Mismatched elements: 12836 / 134217728 (0.0%)
Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed)
Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed)

[Test 4: rel_bias_v2]
```

This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117
Approved by: https://github.com/mlazos
2025-08-27 18:39:09 +00:00
ec585ceab4 [inductor] structured-log graph execution order + test (#160448)
Summary:

- Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse.
- Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}.

Testing:
- Add inline test to verify structure and output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448
Approved by: https://github.com/xmfan
2025-08-27 18:12:46 +00:00
16ce6a4aad [hop] move insert_deferred_runtime_asserts under subtracer (#161416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161416
Approved by: https://github.com/pianpwk
ghstack dependencies: #160548
2025-08-27 17:43:02 +00:00
3345a7ff8a [VLLM][FLASHINFER UPDATE] (#161537)
VLLM build x torch fails due to flashinfer build fail, detected that vllm team recently changed the point to flashinfer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161537
Approved by: https://github.com/huydhn
2025-08-27 17:41:26 +00:00
55e6ea105c Fix running the benchmark jobs twice (#161619)
I made a mistake in https://github.com/pytorch/pytorch/pull/160935 removing this condition check.  This ran the benchmark job twice for schedule jobs, i.e. https://github.com/pytorch/pytorch/actions/runs/17266546494.  This was missed during testing because `pull_request` and `workflow_dispatch` were working ok.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161619
Approved by: https://github.com/anijain2305
2025-08-27 17:18:10 +00:00
a3fa1b8c2a Set USE_NVSHMEM only if USE_DISTRIBUTED is set (#161451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161451
Approved by: https://github.com/eqy
2025-08-27 17:11:19 +00:00
620d52e882 Fix sort doc error (#161539)
Fixes #129298. Updated torch.sort documentation so that the 'stable' parameter is a Keyword Argument. This is how it's implemented in PyTorch.
@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161539
Approved by: https://github.com/soulitzer
2025-08-27 17:01:53 +00:00
69c7b16e6f Revert "Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)" (#161002)"
This reverts commit a03cc53e6f6e2fe67316cb8c74c25f5b953f445b.

Reverted https://github.com/pytorch/pytorch/pull/161002 on behalf of https://github.com/guangyey due to This PR breaks CI TestCudaMallocAsync::test_allocator_settings ([comment](https://github.com/pytorch/pytorch/pull/161002#issuecomment-3228980897))
2025-08-27 16:52:22 +00:00
379ebdaf5e [OrderedDict] Implement OrderedDict.popitem(last=...) (#155153)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155153
Approved by: https://github.com/anijain2305
ghstack dependencies: #160156, #155072, #155152
2025-08-27 15:46:40 +00:00
7c8f049d54 [OrderedDict] Implement OrderedDict.move_to_end(key, last=False) (#155152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155152
Approved by: https://github.com/anijain2305
ghstack dependencies: #160156, #155072
2025-08-27 15:46:40 +00:00
e3718c4855 [dict] Implement dict.__ior__ and fix return type in dict.__or__ (#155072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155072
Approved by: https://github.com/anijain2305
ghstack dependencies: #160156
2025-08-27 15:46:40 +00:00
2d44969bbd Wrap class definitions in set_fullgraph(False) in test_dict/test_ordered_dict (#160156)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160156
Approved by: https://github.com/zou3519
2025-08-27 15:46:40 +00:00
a2af6a9d6b Run WoArm64 CI every 4 hours (#161504)
Since WoArm64 isn’t part of CI yet, this PR schedules the workflow to increase visibility and insights. It will execute every 4 hours and still support manual runs via the `ciflow/win-arm64` tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161504
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-08-27 15:46:34 +00:00
28af843ee0 Revert "Fix index_add for int64 input + zerodim index (#161511)"
This reverts commit d51486616cb3fe54bc298669a88059be56c1fb22.

Reverted https://github.com/pytorch/pytorch/pull/161511 on behalf of https://github.com/clee2000 due to broke test_indexing.py::TestIndexingCPU::test_index_add_zerodim_index_floating_alpha_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/17257089116/job/48971728595) [HUD commit link](d51486616c) on dynamo? ([comment](https://github.com/pytorch/pytorch/pull/161511#issuecomment-3228705842))
2025-08-27 15:38:11 +00:00
378edb047f [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
2025-08-27 14:49:20 +00:00
d2db6c86b0 [OpenReg] Add Develop Notes for Integrating New Backend into PyTorch (#158644)
To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators.

This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644
Approved by: https://github.com/albanD
2025-08-27 14:47:25 +00:00
a3c1cbdbc6 [dynamo][higher order ops] Refactor for out spec (#161354)
Preparing for the next PR to add more info in the output spec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161354
Approved by: https://github.com/zou3519
2025-08-27 14:41:18 +00:00
9632f4ea9f [CD] [aarch64] Add CUDA 13.0 sbsa nightly build (#161257)
https://github.com/pytorch/pytorch/issues/159779

CUDA SBSA build for CUDA 13.0
1. Supported archs: sm_80 to sm_120. Including support for Thor (sm_110), SPARK (sm_121), GB300 (sm_103).
"This release adds support of SM110 GPUs for arm64-sbsa on Linux." from 13.0 release notes https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
2. Use -compress-mode=size for binary size reduction, 13.0 wheel is 2.18 GB, when compared with 12.9 3.28 GB, that is 1.1 GB of savings and ~33.5% smaller.
3. Refactored the libs_to_copy list with common libs, and version_specific_libs.

TODO: add the other CUDA archs in the existing support matrix of x86 to SBSA build as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161257
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-08-27 14:38:07 +00:00
3d406429b0 [dynamo][vllm] Support typing.get_type_hints (#161362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161362
Approved by: https://github.com/Skylion007, https://github.com/StrongerXi, https://github.com/jansel
2025-08-27 09:55:31 +00:00
9a12bab0d3 Add debug handle to inductor provenance tracking (#161110)
Summary:
Use debug handle on kernel names to distinguish different calls to the same kernel.

Previous kernel name: kernel_name

New kernel name: kernel_name:debug_handle

We add the debug handle to the tlparse artifacts: `inductor_provenance_tracking_node_mappings` and `inductor_provenance_tracking_kernel_stack_traces`.

We also add debug handles in the comments of the generated code so we can map to them in the provenance tracking highlighter tool: https://github.com/pytorch/tlparse/pull/134

Example output code is below. If a kernel doesn't have a debug handle, the `[Provenance debug handles]` comment line will not be written.

```
        # Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.addmm, aten.gelu]
        # [Provenance debug handles] triton_poi_fused_addmm_gelu_2:3
        stream0 = get_raw_stream(0)
        triton_poi_fused_addmm_gelu_2.run(buf4, primals_5, 300, stream=stream0)
```

The debug handles will also be used by downstream profilers such as zoomer.

Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D78994959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161110
Approved by: https://github.com/angelayi
2025-08-27 04:56:11 +00:00
d51486616c Fix index_add for int64 input + zerodim index (#161511)
Fixes #161446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161511
Approved by: https://github.com/malfet
2025-08-27 04:11:10 +00:00
07a4e9fea8 [benchmarks] Skip mobilenetv3_large_100 in CI for accuracy (#161570)
To keep the CI green - https://github.com/pytorch/pytorch/issues/161419

Its unclear if this is a real failure. And debugging it is non trivial.
Skipping for now to keep the CI greenst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161570
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
2025-08-27 03:44:04 +00:00
be55d7ac9e Revert "[Dynamo] Allow inlining into AO quantization modules (#152934)" (#161567)
This reverts commit 20e2ca3e29ce9eb33eef17db077696222c175764.

Fixes https://github.com/pytorch/pytorch/issues/157434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161567
Approved by: https://github.com/Lucaskabela
2025-08-27 03:33:04 +00:00
8b78ba07b1 [dynamo, nested graph breaks] add nested graph break tests (#144516)
Note: nested graph break tests (and wrapped tests) are xfailed/skipped for now - we will iteratively enable the tests as more of the nested graph break implementation is complete.

Differential Revision: [D81084809](https://our.internmc.facebook.com/intern/diff/D81084809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516
Approved by: https://github.com/anijain2305
2025-08-27 03:00:56 +00:00
b36a20d368 Ensure large tensor int32 -> int64 indexing is enabled (#157767)
Fixes: #https://github.com/pytorch/pytorch/issues/157446

I think that this delta is worth the switch form block-ptrs especially since they are deprecated

## Perf Summary

A is nightly B is this diff, so `negative` means this diff improves perf

TOP 5 differences
<img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" />

<details>
  <summary><strong>Full perf table (click to expand)</strong></summary>

| attn_type | dtype | shape(B,Hq,M,Hkv,N,D) | TFlops Version A | TFlops Version B |
| --- | --- | --- | --- | --- |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 258.38834144791923 | 258.6353685004612 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.2192450677751 | 140.12393320464972 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 122.32683823617003 | 118.51603755647925 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.48556906165314 | 137.24259849208627 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 86.59814488695922 | 84.59431398586257 |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 288.52679758135764 | 292.9174195871856 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 172.25541683643277 | 172.94326459828508 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 164.40864610599826 | 165.035129576335 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 176.54876886433945 | 175.08057670028145 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 125.22491679812626 | 121.06201152859151 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 339.11952481874283 | 339.0132835601695 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 227.58583240284406 | 228.21824999409597 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 185.98569659868966 | 182.32850843255093 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 188.9495725191772 | 180.31385312481657 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 106.25789530994302 | 106.55084959448476 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 357.6430536888533 | 363.30843452247274 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 262.3241154406613 | 265.73250045488 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 249.30498953911416 | 249.35928192833785 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 224.74126243851808 | 223.71776504077988 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 168.26977014013707 | 165.47991483333809 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 382.8178701785897 | 384.34752965862685 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 308.1449710013853 | 311.0653716044644 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 251.96365252505072 | 243.92283557225903 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 226.69316232745368 | 215.22769268913356 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 153.34142545296405 | 151.9312673939401 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 396.0998000753126 | 398.35036286102473 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 333.5198415274966 | 344.6354466169716 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 310.5955933379696 | 305.66347819546 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 260.4012412689896 | 259.758666997307 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 234.13034252182635 | 227.61676497283614 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 396.17615538477196 | 401.1419104525502 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 359.98648311998414 | 360.8285563463094 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 291.97720707257736 | 281.41694809965253 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 250.1703628419691 | 238.556760291579 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 199.50782826294306 | 191.52327358439223 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 411.0632004785396 | 413.6362648405517 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 382.9404387613185 | 397.74886235657607 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 357.0998545146633 | 350.5115200772392 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 281.8033924428203 | 281.98601309215843 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 282.56595134222135 | 277.4565795466672 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 408.89838018149516 | 405.14531386840076 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 396.07662058160264 | 393.4598228299578 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 317.8822887267849 | 304.754931401036 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 265.8801304948243 | 254.22961974295112 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 227.87390579965614 | 222.19481980110393 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 427.36821778477025 | 431.3766620314935 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 410.67994346825 | 423.4666944003808 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 381.1968748374038 | 381.77668006420424 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 292.5540046358546 | 296.5439130720502 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 321.04573768858114 | 310.7423616656888 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 427.46148866769903 | 426.162091037068 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 419.75580537687347 | 421.88640120274334 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 337.3208051798903 | 327.4912454675092 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 276.5638854539581 | 262.988360558083 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 250.82791326036886 | 245.07367032501736 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 435.8055824506086 | 441.8803729460534 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 432.02638235921006 | 450.33161016596273 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 402.25525939224883 | 393.8564689669916 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 297.5337286675904 | 297.0131881135074 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 343.8697037899545 | 329.8194073407783 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 267.58912366821056 | 256.91606054118375 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 150.81723692609629 | 146.32172267858743 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 129.51029293209245 | 122.72144394093334 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 147.627656359087 | 141.68956350566188 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 87.55100546003591 | 84.91293287692788 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 299.5931492743986 | 305.884253766691 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 179.39026367843837 | 181.64741311605096 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 173.93547669282367 | 173.23972950980564 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 185.90234171599252 | 182.80844545446686 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 128.08176696266082 | 123.27722685662111 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 340.50674552770664 | 338.9071088484576 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 225.4438318650432 | 230.22899884832975 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 194.15123248528312 | 185.02793973094865 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 200.74289714108176 | 191.76606719670647 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 107.03564946728423 | 106.82432377861258 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 371.31799283918406 | 379.7555394732925 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 275.97762744310455 | 276.71106853992995 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 261.6648679783462 | 259.4127232060398 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 237.03108223577615 | 233.92710216149527 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 172.13926800371152 | 168.74390922407585 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 381.50199487767276 | 383.9043681999597 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 307.9748883093411 | 312.2403515462001 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 251.11319684705438 | 243.17870127827277 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 236.3253127246763 | 223.81250201769552 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 154.55693991756874 | 153.11360584987685 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 407.11400078586615 | 413.53709886086557 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 348.1705797722622 | 360.09771155957367 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 321.8593280850388 | 318.2882327401255 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 270.089032013835 | 268.767323026064 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 238.07324557907788 | 228.09842078362692 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 399.8172853171901 | 401.0954526332136 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 363.4387330438581 | 364.13111024232677 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 294.1752429133857 | 283.7235663368415 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 256.8389394007649 | 246.91771015606483 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 199.3378564292656 | 192.40439590901758 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 425.5150965556111 | 430.8190098707553 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 396.00437184073013 | 411.3873625655787 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 369.92803661607815 | 361.43244467343663 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 293.4277354412933 | 295.2529537595746 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 288.0208673072841 | 281.51896404878863 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 408.3005367220567 | 408.96116482298913 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 396.90095962766304 | 396.87385456176486 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 319.0534576137999 | 302.50950358107764 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 270.3334977708081 | 258.8506349486557 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 227.46824134365394 | 222.23759438128766 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 438.24247309479694 | 437.7975163205371 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 428.34012029699227 | 433.3215899950434 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 386.52672049728875 | 388.26216893354984 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 302.71976814728083 | 302.3574867306459 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 327.39760662780986 | 308.6348428844912 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 423.31308678262695 | 426.6306972137279 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 412.6983690923106 | 419.4961977664297 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 337.41003544742273 | 324.2155049126126 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 278.7755890910794 | 265.9194286636502 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 251.55678254755364 | 244.8843180141462 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 452.5930781172308 | 457.7117122300742 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 445.05676260348116 | 463.9304535499636 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 415.78302138389415 | 406.29229555271456 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 308.0311067300895 | 304.91354721414314 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 351.43943626809335 | 329.4476923070317 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 295.1801525813241 | 291.36521287398904 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 183.23250549178067 | 182.35421238887605 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 151.56832453117747 | 151.3422139154794 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 171.02111935180432 | 160.72516856727913 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 74.05765122783826 | 74.5885345035243 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 314.3587394591763 | 319.2938677773619 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 224.57002084153177 | 225.48868542008177 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.00964804143052 | 215.39576159953486 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.1174237618258 | 214.28437413525663 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 121.08920423648368 | 119.55813661872644 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 362.2193857281911 | 360.05005804275936 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 279.8840217430121 | 279.5437918286659 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 227.76617121021982 | 222.8655938229316 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 215.43141176970562 | 207.71852284994702 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 121.35588364218539 | 121.20636565046884 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 365.1545280898012 | 373.37585444987326 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 304.360119952975 | 309.1247297936263 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 287.2603904544586 | 289.25547903162595 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 257.9852675272418 | 257.59069234098115 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 188.35158496670232 | 184.24683960154857 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 389.9744911369211 | 388.43466897254166 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 345.9228295166513 | 342.63034895210126 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 279.56334658247437 | 271.2724375402088 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 245.66477202810066 | 233.49688207371258 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 170.3270720653187 | 166.23863845657382 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 400.0041140827554 | 402.11182445396497 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 363.64641830327434 | 375.9288663364792 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 341.5776139573363 | 335.1160003213424 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 281.1811770268521 | 280.21438270014005 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 247.78716118997716 | 245.3269825179633 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 403.794126680488 | 405.2353919019577 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 387.079178426863 | 385.1461762057035 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 309.7847188173431 | 298.0443968374749 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 262.4721750159666 | 250.81679725428586 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 205.70866004479979 | 202.9620839129557 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 413.380982988662 | 418.40270594263103 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 398.450064800682 | 409.6794973994029 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 372.26297458194466 | 364.44415106552196 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 293.0818569905912 | 292.85172400643984 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 296.46717085592087 | 285.76362010612763 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 419.3186786037592 | 426.08801580934437 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 408.1648467766632 | 409.4122254207817 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 329.24396020457345 | 313.5200995121138 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 274.61257504571876 | 255.7801815432177 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 232.63806001220684 | 230.03020843492314 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 435.0785891054788 | 440.39101804225345 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 424.86925312752817 | 435.18898057396825 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 393.000417896268 | 395.11543361225256 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 297.7755459218185 | 300.7208114715287 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 331.71570861760534 | 318.07127352552885 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 424.58602747137405 | 425.84897078470715 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 422.66607285025725 | 423.5524945535485 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 344.8625760048626 | 331.6793888458635 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 282.0787281511649 | 263.7895634445868 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 252.7301927385177 | 245.41844170037427 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 437.0658069164588 | 442.9101960063628 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 433.13788271434646 | 452.3873572709863 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 404.0959191546953 | 396.7077863894884 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 300.45502211883206 | 301.3439134717943 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 344.11003202413934 | 330.8897663350314 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 298.4364205341705 | 291.6793556507056 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 187.6382133139633 | 191.05409897308772 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 156.55822078636112 | 154.178925976516 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 173.47765221825162 | 169.30862508068464 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 74.5885345035243 | 74.52689061607104 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 323.12233826013045 | 328.53889207933514 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 236.75872140126316 | 235.8378325547398 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 227.17836523816675 | 226.75357076139966 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 224.07209453308036 | 224.07209453308036 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 122.85572156047981 | 121.11642183704716 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 361.3123326658092 | 360.71014086458337 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 281.5287983927017 | 281.94301754758345 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 232.7456696285686 | 226.50976826432776 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 221.5612361744038 | 214.96188822837055 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 121.38311528944315 | 120.85441868178513 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 380.2579019244734 | 389.2520157863988 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 316.95230660496924 | 317.87597790618906 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 301.07968126657323 | 298.02424098422983 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 267.2240756921594 | 267.16353549228154 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 189.82761622494257 | 186.736450261963 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 389.88665375406805 | 387.9125133037077 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 348.70619958684887 | 346.6750499749774 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 280.5472989906087 | 271.22300822012187 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 250.02397620165968 | 241.22532776331445 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 171.67817496107645 | 166.95679280483972 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 412.626880230807 | 417.60238657950777 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 374.8829313933945 | 389.4448546468815 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 353.20410434172436 | 345.7072490717473 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 292.51045924209586 | 291.66621022138287 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 251.6264062063495 | 248.45110052911542 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 404.0155784550126 | 401.90546837237514 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 384.4389015599863 | 386.9684324594344 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 313.3731284132225 | 298.17074251037894 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 264.19199737284265 | 252.8982463999916 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 207.03696315185684 | 202.86697323136772 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 428.2436763312506 | 433.45005568619536 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 411.8516531869893 | 428.2753623461049 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 384.9095037182509 | 372.90888743000744 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 303.2438915629836 | 302.05095952914337 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 301.8689122735564 | 285.0363190513223 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 423.13592231504805 | 420.3991500185611 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 407.44527331585493 | 408.5064370765247 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 330.50050996167414 | 316.8763979925965 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 274.6833786307413 | 259.86098862141324 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 232.24019584158367 | 226.52040268160232 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 444.4596314237808 | 455.99558915752266 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 437.4245561244369 | 455.98275147271966 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 397.3350686877605 | 397.88875599028063 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 308.53809114394545 | 307.1359822042007 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 331.32379843423774 | 316.85293191675646 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 422.4622274366379 | 425.0407156418684 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 420.9547052783101 | 430.33779243510276 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 345.50265346504085 | 332.094855328957 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 280.81715528243365 | 264.6543640282054 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 252.25635200421783 | 245.46235499490305 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 452.5524207341139 | 461.7512032176736 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 445.2316469907137 | 464.4523799578466 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 416.87264016717023 | 409.17124592157046 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 309.42579489389846 | 307.9734464665731 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 350.50782004300623 | 330.98959545427294 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767
Approved by: https://github.com/Skylion007
2025-08-27 02:45:20 +00:00
de58505890 Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)"
This reverts commit cddcaa19035d6414a351be7c7b16c47d5a0c3466.

Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))
2025-08-27 02:36:42 +00:00
6913529ff8 Move non inductor workflows to Python 3.9 -> 3.10 (#161182)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/seemethere
2025-08-27 02:32:24 +00:00
4b4cdcfe3a Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)
- Fix Conv exhaustive.
- Fix AMD config pruning.
- Expand exhaustive test suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387
Approved by: https://github.com/coconutruben
2025-08-27 01:54:50 +00:00
68d395d61e [3/N][SymmMem] Expose offset field from handle (#161532)
As titled, so that kernels relying on direct pointers can use base address and `hdl.offset` to access remote memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161532
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471
2025-08-27 00:49:06 +00:00
4ed71d5412 [2/N][SymmMem] Add MemPool allocator and tests (#161471)
(Porting most of #161008)

Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.

To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
    tensor = torch.arange(numel, dtype=dtype, device=device)
```

Added tests for both use cases above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
2025-08-27 00:49:06 +00:00
8dd5aa9689 [1/N][SymmMem] Add offset to handle, cache on base address (#161470)
For the kernels that need peer pointers directly, the rendezvous handle should allow user to get the offset of tensor wrt to base allocation address. Thus the need to add an `offset` field to SymmMem handle.

But we don't want to cache all the handles just bc they have different offsets, hence the search and cache logic below:

(i) At rendezvous, the search key is still `x.storage().data_ptr()`, like now, but it should do search in 2 parts - one is just dictionary lookup, like today, if that failed, it needs to search `allocations_` to see if the storage ptr falls in one of the segments. This is possible as we have all segments recorded during alloc.
(ii) If this segment hasn't been rendezvoused, we rendezvous it, cache it in the `symm_mem_` map with its base address as key.
(iii) We still need to return a handle for the current tensor, with a corresponding offset. This handle will be a shallow copy of the base handle, with the offset adjusted.

Some impl details:
(i.1) If we find a matching allocation, we can immediately use the allocation base address to do a re-search in `symm_mem_`.

(iii.1) To make the handle copy shallow, we move the common information -- base ptrs, base signal pad, etc -- to a structure referenced by both handles. The structure is called `NVSHMEMPeerAllocInfo`. A copy of handle just adds one more `intrusive_ptr` to it. The handle copy constructor accepts an `offset` argument.

Test:
Existing tests should not fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161470
Approved by: https://github.com/ngimel
2025-08-27 00:49:06 +00:00
8ff9485815 [export] Update unflattening dynamo.disable (#161306)
Summary:
Doing inline disabling causes recompiles with the reason "Cache line
invalidated because L['___stack0'] got deallocated"

Test Plan:
CI

Rollback Plan:

Differential Revision: D80816956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161306
Approved by: https://github.com/pianpwk
2025-08-27 00:27:16 +00:00
b074cbaedd [dynamo] allow resume functions to have name in both freevars and varnames (#161544)
fixes https://github.com/pytorch/pytorch/issues/161542

Differential Revision: [D81073109](https://our.internmc.facebook.com/intern/diff/D81073109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161544
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-08-27 00:25:16 +00:00
80bf883d21 Replace manual cache in _python_dispatch.get_alias_info with functools.cache (#161286)
In addition to being more code, the manual cache was doing an extra dictionary lookup on each cache hit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161286
Approved by: https://github.com/wconstab
2025-08-27 00:17:51 +00:00
9de9d25f8d [Inductor-FX] Support custom triton kernels (#161474)
# Feature
Add support for custom Triton kernels to the FX backend. This turned out not to require any new features, except for a minor change to handle `tl.constexpr` arguments which are not part of the autotuning config.

# Caveat

This may not cover every possible case. For example, we might need more features for autotuning custom Triton code. This PR entirely skips the [custom codegen ](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py#L1034-L1039) for user-defined grid functions, but there may be edge cases requiring this logic. However, this PR seems to do a reasonable job as many of the grids end up being written into Inductor/Triton metadata and don't require special codegen.

As a follow up, I'm planning to test this against all of AOTI's custom Triton kernel tests.

# Test plan
Added a CI test using a custom Triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161474
Approved by: https://github.com/angelayi
2025-08-27 00:15:19 +00:00
dbc903a94a [APS IR] Minfor fix - use GetAttrKey in get_keystr to match with flat args path in unflatten (#161453)
Summary: While passing path info to [_check_input_constraints_for_graph](https://www.internalfb.com/code/fbsource/[6b5b2dc35902a26ce265e3c0ae5189a3faba1d38]/fbcode/caffe2/torch/export/unflatten.py?lines=594), GetAttrKey is used to specify path str. To match with that get_keystr should also use GetAttrKey.

Test Plan:
Existing tests

```
buck run mode/opt caffe2/test:test_export -- -r unflatten
```

```
Ran 413 tests in 204.533s

OK (skipped=1, expected failures=13)
```

Rollback Plan:

Differential Revision: D80984083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161453
Approved by: https://github.com/tugsbayasgalan
2025-08-27 00:05:20 +00:00
1b34e04485 Revert "Update pybind11 submodule to 3.0.1 (#160754)"
This reverts commit 660b0b8128181d11165176ea3f979fa899f24db1.

Reverted https://github.com/pytorch/pytorch/pull/160754 on behalf of https://github.com/atalman due to please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226078102))
2025-08-26 23:35:22 +00:00
1ce423274d Revert "[cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063)"
This reverts commit 74c4c758afa8c28162f00a456c185552e1159fd3.

Reverted https://github.com/pytorch/pytorch/pull/161063 on behalf of https://github.com/atalman due to sorry broke vllm tests please see https://github.com/pytorch/pytorch/pull/160754#issuecomment-3226051449 ([comment](https://github.com/pytorch/pytorch/pull/161063#issuecomment-3226065212))
2025-08-26 23:31:23 +00:00
4e630f0629 Revert "[Inductor] Update Outer Reduction Heuristic (#159093)"
This reverts commit ca9fe0107e165a4a4147325ff6d34235ebde447f.

Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/PaulZhang12 due to Addressing internal implications then relanding ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3225942525))
2025-08-26 22:37:56 +00:00
cddcaa1903 [Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677)
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).

Changes Included

- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.

Fixes #147282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
2025-08-26 22:33:23 +00:00
1e4dfeeb06 Add early_stop kwarg to torch.utils.checkpoint (#160781)
We already have a context manager "set_checkpoint_early_stop". This PR adds a kwarg that toggles the same setting.

It is also useful to have a kwarg version of the setting in addition to the context manager because is annoying to apply a context manager when the AC is being applied via CheckpointWrapper.

Similar to the "debug" kwarg and the corresponding "set_checkpoint_debug_enabled" context manager, the context manager defaults to None and overrides the local setting when non-None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160781
Approved by: https://github.com/tianyu-l
2025-08-26 22:32:35 +00:00
4d078cfc4e [fx] Add is_fx_symbolic_tracing flag (#161385)
Fixes https://github.com/pytorch/pytorch/issues/135276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161385
Approved by: https://github.com/pianpwk
2025-08-26 22:26:27 +00:00
da838f65af [ONNX] Drop draft_export in exporter API (#161454)
If onnx exporter fallbacks to draft_export with big models, this is taking forever for users, and possibly spam the printout, which keeps users from their stack trace with strict=False.

We could consider make another API for draft_export as debugging tool, or combine it with report=True when "model is small"?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161454
Approved by: https://github.com/justinchuby
2025-08-26 22:13:43 +00:00
cde54fe4e9 fix-unpin-memory-tensor-param (#160992)
Fixes #160983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160992
Approved by: https://github.com/ngimel
2025-08-26 21:55:25 +00:00
e06d1d6610 [BE] Improve torch.inference_mode docs and error message (#161164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161164
Approved by: https://github.com/sfc-gh-sbekman, https://github.com/janeyx99
2025-08-26 20:58:56 +00:00
b2db293abc [ROCm] No-fence global reduce (#161180)
This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161180
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 20:43:59 +00:00
6686974ddd Revert "[dynamo, nested graph breaks] add nested graph break tests (#144516)"
This reverts commit 9a756c2d710a0680bac93ab0b42db519ec2dc6cf.

Reverted https://github.com/pytorch/pytorch/pull/144516 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/144516#issuecomment-3225659358))
2025-08-26 20:40:17 +00:00
eqy
3d82256a86 [FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on sm110 or sm120 either (#161236)
See also #160693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161236
Approved by: https://github.com/Skylion007
2025-08-26 20:40:11 +00:00
a4fb65701b Revert "[dynamo, nested graph breaks] support very simple nested graph breaks (#159329)"
This reverts commit 8dab6d4c414bf997297804008c3da893e69cd51f.

Reverted https://github.com/pytorch/pytorch/pull/159329 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/159329#issuecomment-3225617445))
2025-08-26 20:24:10 +00:00
6afd766401 Revert "[dynamo, nested graph breaks] support nested graph breaks x context managers (#159678)"
This reverts commit 02fa5bf6d80fa4baa6bb6dd2fa6a16d88852da91.

Reverted https://github.com/pytorch/pytorch/pull/159678 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159678#issuecomment-3225597425))
2025-08-26 20:16:36 +00:00
a7aa480e55 Revert "[dynamo, nested graph breaks] support nested closures (#159817)"
This reverts commit ef0ef6f93f7ef6d16d71a6997b72185504acd4b6.

Reverted https://github.com/pytorch/pytorch/pull/159817 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159817#issuecomment-3225586996))
2025-08-26 20:13:33 +00:00
9f6e1b8730 Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864)"
This reverts commit 3caddd4daa5b1a167663c07219e065e86247ad76.

Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))
2025-08-26 20:03:59 +00:00
caf98fde0d Revert "[dynamo, nested graph breaks] clean up comments and codegen (#160138)"
This reverts commit ac6316caaa74513cbcf3c7f9269bc23cd74749db.

Reverted https://github.com/pytorch/pytorch/pull/160138 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/160138#issuecomment-3225546707))
2025-08-26 20:01:26 +00:00
46576f5a16 Revert "[dynamo, nested graph breaks] prevent excessive recompilations (#159786)"
This reverts commit 67d31f6b281d3b15b205756fc7ebc450cdde1dab.

Reverted https://github.com/pytorch/pytorch/pull/159786 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/159786#issuecomment-3225535752))
2025-08-26 19:54:22 +00:00
77bc959fe1 Add inductor backend to device interface; make minifier_tests more device agnostic (#151314)
Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314
Approved by: https://github.com/eellison
2025-08-26 19:40:37 +00:00
262640fd22 [ROCm][CI] restore test_flex_attention tests (#161519)
Reverts #161450 and targets specific subtests to skip on MI200.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161519
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 19:31:30 +00:00
74124d1b46 [reland] [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#161514)
Summary:
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.

This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.

Test Plan:
CI

Rollback Plan:

Differential Revision: D81041296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161514
Approved by: https://github.com/tugsbayasgalan
2025-08-26 19:16:05 +00:00
a03cc53e6f Back out "Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)" (#161002)
Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails.

Test Plan:
NA

Rollback Plan:

Differential Revision: D80553588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161002
Approved by: https://github.com/jingsh, https://github.com/izaitsevfb
2025-08-26 19:04:13 +00:00
00efeabc29 [hop] make materialize_as_graph disable pre-existing dispatch modes (#161220)
For materializing_as_subgraph, we just want to trace a graph. The handling of different modes should register their own logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161220
Approved by: https://github.com/Lucaskabela
2025-08-26 18:52:38 +00:00
d4703fb91c [dtensor] Add propagate_tensor_meta function that skips cache if _are_we_tracing (#161334)
Fixes an issue where the log softmax handler checked the tensor metadata cache without checking for tracing or symints.

Probably best to merge this after #160798, but not strictly blocking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161334
Approved by: https://github.com/xmfan
2025-08-26 18:46:58 +00:00
cd87f30295 DOC: Clarify documentation for torch.matmul and fix a typo (#161424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161424
Approved by: https://github.com/AlannaBurke
2025-08-26 18:30:57 +00:00
f0e0a6897e type misc init and tools for dynamo (#161293)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161293
Approved by: https://github.com/anijain2305
2025-08-26 17:38:49 +00:00
d2bd55d8de Typo correction in variable name inital_grad of Class TestFullyShardG… (#161501)
Typo correction in variable name inital_grad of Class TestFullyShardGradientScaler implementation.

Fixes #161480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161501
Approved by: https://github.com/soulitzer
2025-08-26 17:16:42 +00:00
6598f00c18 [dynamo] auto lift unbacked symbol in tensor's storage_offset (#161199)
```python
import torch

torch._dynamo.config.capture_scalar_outputs = True

class M(torch.nn.Module):
    def forward(self, idx, x):
        u0 = idx.item()
        x0 = x.select(0, u0)
        def fn():
            return x0.sin()
        return torch.cond(x0.sum() > 0, fn, fn)

m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64, device="cuda"), torch.randn(3, 3, device="cuda"))
print(out)

```

Before the PR, we didn't track the storage_offset symbol of a tensor. After https://github.com/pytorch/pytorch/pull/157605, we create an unbacked_symint for stroage_offset for the result of select. So when we try to lift the free basic symbols of x0  during speculating fn, we found a free symbol that's not bound to a proxy.

This PR tracks the symbols of storage_offset and associated it with a proxy using torch.ops.aten.storage_offest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161199
Approved by: https://github.com/zou3519
ghstack dependencies: #161198
2025-08-26 17:06:54 +00:00
ba6ce66698 [dynamo] lift backed symint output of item() (#161198)
Before the change in this PR, we have an error for the following code
```python
import torch

torch._dynamo.config.capture_scalar_outputs = True

class M(torch.nn.Module):
    def forward(self, idx, x):
        u0 = idx.item()
        x0 = x.select(0, u0)
        def fn():
            return x0.sin()
        return torch.cond(x0.sum() > 0, fn, fn)

m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64), torch.randn(3, 3))
```

The error is caused when speculate fn, and tries to lift symbol of x0.storage_offset() but found the symbols doesn't have a source associated with it.

What really happens is that, when input tensor is a scalar tensor of int type and resides on CPU, we have a short cut that creates a norm symint when .item() is called see https://github.com/pytorch/pytorch/pull/126245.

However, previously, we only track the unbacked symint output of an operation because we believe all the backed symint must have a source associated with it and has already bee lifted as input at the top-level. Now this invariant no longer holds, so we end up an error saying the symbol doesn't have source (because only input and symbols derided from inputs have source and result of .item() doesn't have a source).

In this PR, we start to also track the normal symint with the proxy that created it (i.e. in this case the proxy .item()).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161198
Approved by: https://github.com/zou3519
2025-08-26 17:06:54 +00:00
ca9fe0107e [Inductor] Update Outer Reduction Heuristic (#159093)
Update outer reduction heuristics for significant speedups.

HuggingFace:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" />

Average ~20% speedup on a kernel by kernel basis

TorchBench:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" />

Average ~40% speedup on a kernel by kernel basis

<img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" />

Differential Revision: [D80835998](https://our.internmc.facebook.com/intern/diff/D80835998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093
Approved by: https://github.com/jansel
2025-08-26 16:12:07 +00:00
f9df4ec2af SDPA skip logic for ROCm (#160522)
Skips some test for flex and eff attention if they are not supported by the hardware

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160522
Approved by: https://github.com/drisspg, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 15:51:07 +00:00
a72803f1e3 [ez][CI] GIve the linux check job a name that isn't linux-job (#161413)
Reason:
The default name is linux-job, which gets put in the linux category on HUD, but this isn't really a linux related job.  Renaming it like this will make it go into the "other" category on HUD

Other options:
Change the grouping code in test-infra
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161413
Approved by: https://github.com/huydhn, https://github.com/seemethere
2025-08-26 15:18:35 +00:00
10e67f5ec3 forward fix #161102 (#161465)
PR #161102 caused tf32 to be the default precision for flex attention.  This PR forward-fixes the broken logic and restores ROCm MI200 CI flex attention test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161465
Approved by: https://github.com/jeffdaily, https://github.com/eqy

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 15:11:54 +00:00
818ba434c7 Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767)"
This reverts commit fc69c2bc67672c3b2d0c62c1821895f09288f1c0.

Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to internal failure, sorry will revert ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3224341111))
2025-08-26 14:12:06 +00:00
ae8d319fd4 Update NVSHMEM to 3.3.24 and fix download link (#161321)
https://github.com/pytorch/pytorch/issues/159779

Update NVSHMEM 3.3.24 for [PyTorch CUDA13 Binary Cannot Be Built with SM_75 with NVSHMEM](https://github.com/pytorch/pytorch/issues/160980)
Enabled back sm_75 for NVSHMEM
Fixed the NVSHMEM download link for the issue with 3.3.20 download in issue - [[CD] nvshem-3.3.9 wheels for aarch64 is not manylinux2_28 compliant](https://github.com/pytorch/pytorch/issues/160425)

Todo: Should also enable back build ARM with NVSHMEM since it is compatible with manylinux2_28

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161321
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-08-26 13:26:18 +00:00
e795450a35 Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)"
This reverts commit 447d34b5f80fb7350f79decd855cb599cab39083.

Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to reverting since can't land existing diff internally, will need to reland it ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3224029031))
2025-08-26 12:45:59 +00:00
8c506e6310 [easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445)
This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed.

Notably:
* the test has output_size kwarg specified
* the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur)

Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445
Approved by: https://github.com/eellison, https://github.com/v0i0
2025-08-26 12:32:24 +00:00
4a1aca11c2 Revert "[inductor] structured-log graph execution order + test (#160448)"
This reverts commit 995397d47a0e27394ee1010f158e181eb304100a.

Reverted https://github.com/pytorch/pytorch/pull/160448 on behalf of https://github.com/atalman due to internal failure please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/160448#issuecomment-3223939035))
2025-08-26 12:20:37 +00:00
e9d42b3880 [small][muon] Use addmm for Newton–Schulz orthogonalization (#161379)
A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op.

**Benchmark**
In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a **1.62×** speedup.

matmul
<img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" />

addmm
<img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" />

**Testing**
End-to-end training:
We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm.
<img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" />

Unit test:

```python
    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = Muon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
        nesterov=nesterov,
        adjust_lr_fn="original",
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
        nesterov=nesterov,
        adjust_lr_fn="original",
        use_addmm=True,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)
```

shows numeric difference, but this is expected on bf16 precision:
```
Mismatched elements: 96 / 100 (96.0%)
Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed)
Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed)
```

~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~
Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161379
Approved by: https://github.com/janeyx99
2025-08-26 09:17:28 +00:00
8cfc119491 [pytorch] Simplify codes using std::all_of() for _check_tensors_share_device_and_dtype() (#161411)
Summary: These two nested loops of checks could be simplified with `std::all_of()` to make it more compact.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80946082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161411
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-08-26 08:56:24 +00:00
e7e270a33a [pytorch] Merge two nested if statement checks into one (#161387)
Summary: This reduces the code indentation level by one.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80915357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161387
Approved by: https://github.com/janeyx99
2025-08-26 08:45:36 +00:00
6aef9f3a69 [Inductor][Tritonparse] Call jit_post_compile_hook within Inductor Triton Kernel compile path (#161443)
Summary: Since Inductor skips JIT compilation for Triton kernels, we need to manually invoke `knobs.runtime.jit_post_compile_hook` if one exists. Here, we do this to enable Tritonparse to extract launch metadata from Inductor launched kernels. We can control whether or not Inductor will run the hook with a new `TORCHINDUCTOR_RUN_JIT_POST_COMPILE_HOOK=1 ` config variable.

Reviewed By: davidberard98

Differential Revision: D80624932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161443
Approved by: https://github.com/FindHao
2025-08-26 06:24:42 +00:00
7376111d59 [BE] fix compute_global_tensor_shape test (#161441)
Fixes #161154

**Test**
`pytest  test/distributed/tensor/test_utils.py -s -k test_compute_global_tensor_shape_1D`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161441
Approved by: https://github.com/kwen2501
2025-08-26 03:22:29 +00:00
92ab184824 Revert "[Inductor] Prune configs that require more shared memory than the hardware limit (#161040)"
This reverts commit b2e06e0194c3fa8f7578a1b48751cc027394fb67.

Reverted https://github.com/pytorch/pytorch/pull/161040 on behalf of https://github.com/jeffdaily due to still failing on rocm, see https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=inductor%2Ftest_triton_heuristics.py%3A%3ATestTritonHeuristics%3A%3Atest_prune_configs_over_shared_memory_limit_do_pruning_True ([comment](https://github.com/pytorch/pytorch/pull/161040#issuecomment-3222430129))
2025-08-26 03:15:32 +00:00
8c442e4fd3 Fix LBFGS warning convert a tensor with requires_grad=True to a scalar (#160389)
Fixes #160197

## Test Result

```python
In [1]: import warnings
   ...: warnings.simplefilter('error')
   ...: import torch
   ...: print(torch.__version__)
   ...: a, b = torch.rand((2, 32, 32))
   ...: a.requires_grad_()
   ...: optimizer = torch.optim.LBFGS([a])
   ...: loss_fn = lambda x, y: (x-y).pow(2).mean()
   ...:
   ...: def closure():
   ...:     optimizer.zero_grad()
   ...:     loss = loss_fn(a, b)
   ...:     loss.backward()
   ...:     return loss
   ...:
   ...: for i in range(100):
   ...:     optimizer.step(closure)
   ...:     print(i, loss_fn(a, b))
   ...:
2.9.0a0+gitf33f3f8
0 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
1 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
2 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
3 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
4 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
5 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
6 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
7 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
8 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
9 tensor(5.8066e-11, grad_fn=<MeanBackward0>)
10 tensor(5.8066e-11, grad_fn=<MeanBackward0>)

...

```

```bash
pytest test/test_optim.py -vv

...
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_NAdam_cuda_float32 PASSED [2.7192s]                                                                                                                                           [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RAdam_cuda_float32 PASSED [2.5370s]                                                                                                                                           [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_RMSprop_cuda_float32 PASSED [2.0190s]                                                                                                                                         [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_Rprop_cuda_float32 PASSED [1.8554s]                                                                                                                                           [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SGD_cuda_float32 PASSED [2.0433s]                                                                                                                                             [ 99%]
test/test_optim.py::TestOptimRenewedCUDA::test_tensor_lr_num_dim_2_SparseAdam_cuda_float32 PASSED [1.1788s]                                                                                                                                      [100%]

================== 1471 passed, 242 skipped in 2440.52s (0:40:40) ============
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160389
Approved by: https://github.com/janeyx99

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-08-26 03:07:47 +00:00
e34b6a0103 Add meta for add.Scalar (#161332)
Fixes https://github.com/pytorch/pytorch/issues/161076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161332
Approved by: https://github.com/Skylion007
2025-08-26 02:26:51 +00:00
f795e92802 space added between type and checking for typechecking (#161352)
space added between type and checking for "typechecking"

Fixes #161282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161352
Approved by: https://github.com/malfet
2025-08-26 02:07:33 +00:00
becd6cd744 Increase timeout value when pushing to ghcr.io (#161444)
Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047.  The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444
Approved by: https://github.com/atalman
2025-08-26 01:51:16 +00:00
ec21cafd85 [OpenReg] Refactor and Optimize the OpenReg for Preparation of Docs (#159640)
As the title stated.

**Changes:**

- Fixed a bug where abs_stub could not be triggered
- Refactor registration to prepare for documentation
- Add meta, fallback for openreg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159640
Approved by: https://github.com/albanD
2025-08-26 01:44:21 +00:00
908b0ccb1f Revert "Increase timeout value when pushing to ghcr.io (#161444)"
This reverts commit b9e9e92817fd7d1a778f074105603efb07e05004.

Reverted https://github.com/pytorch/pytorch/pull/161444 on behalf of https://github.com/huydhn due to Reland this to generate a different has value for the benchmark Docker image ([comment](https://github.com/pytorch/pytorch/pull/161444#issuecomment-3222257119))
2025-08-26 01:41:59 +00:00
85adf80cf1 Disable inductor/test_flex_attention.py (#161450)
Currently inductor/test_flex_attention.py is causing rocm pytorch mi250 shard 1 to go over the timeout limit. This PR is for disabling that test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161450
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-26 01:28:51 +00:00
74c4c758af [cpp_wrapper] Swap to new PyBind11 simple GIL header (#161063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161063
Approved by: https://github.com/Skylion007
ghstack dependencies: #160754
2025-08-26 01:21:18 +00:00
660b0b8128 Update pybind11 submodule to 3.0.1 (#160754)
Upgrade to PyBind11 v3. This allows us to strip out our own (possibly broken?) handling of the C++ ABI when building extensions, in favor of the more-complete PyBind11 internal handling.

Fixes a few test failures due to https://github.com/pybind/pybind11/issues/5774, which effectively makes the `__qualname__` attribute of functions platform-dependent.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160754
Approved by: https://github.com/Skylion007
2025-08-26 01:21:18 +00:00
089ad1d88b [1/n][export] Refactor PT2 Archive weight saving and loading (#160394)
Summary:

We split the refactoring in two parts for forward compatibility concerns
First, we land the deserialization (loading part)
Then, we land the serialization (saving part)

Save weights and constants as individual files in PT2 archive. Each weight/constant will be saved as raw bytes, unless it is a custom object (TorchBind object) or a non-fake tensor subclass, for these two special cases we still save them using pickle.

The metadata of saved tensors along with the file name will be saved as `PayloadMeta`.
The mapping from FQN to `PayloadMeta` will be saved as `PayloadConfig` under `WEIGHTS_CONFIG_FORMAT` and `CONTANTS_CONFIG_FORMAT`

This changes the serialization in python side when calling `torch.export.save()`.

For deserialization in python `torch.export.load()`, we make it BC-safe by allowing loading legacy format weights/constants.

For deserialization in C++ `torch/nativert/ModelRunner.cpp`, we make this a BC breaking change as currently the OSS ModelRunner API is not being used.

The file structure

```
├── archive_format
├── archive_version
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── sample_inputs
│   │   └── model.pt
│   ├── constants
│   │   ├── tensor_0
│   │   ├── tensor_1
│   │   └── model_constants_config.json
│   └── weights
│       ├── weight_0
│       ├── weight_1
│       ├── weight_2
│       ├── weight_3
│       └── model_weights_config.json
└── models
    └── model.json
```

Test Plan:
CI

Rollback Plan:

Differential Revision: D80035490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160394
Approved by: https://github.com/SherlockNoMad
2025-08-26 01:15:42 +00:00
67d31f6b28 [dynamo, nested graph breaks] prevent excessive recompilations (#159786)
Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object.

Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break.

Followup: we can skip guards on continuation functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817, #160138
2025-08-26 00:58:38 +00:00
ac6316caaa [dynamo, nested graph breaks] clean up comments and codegen (#160138)
Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures.

Also simplify some codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329, #159678, #159817
2025-08-26 00:58:38 +00:00
ef0ef6f93f [dynamo, nested graph breaks] support nested closures (#159817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159817
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329, #159678
2025-08-26 00:58:28 +00:00
02fa5bf6d8 [dynamo, nested graph breaks] support nested graph breaks x context managers (#159678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159678
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516, #159329
2025-08-26 00:58:18 +00:00
8dab6d4c41 [dynamo, nested graph breaks] support very simple nested graph breaks (#159329)
e.g. this graph breaks once now:
```python
import torch

torch._dynamo.config.nested_graph_breaks = True

def inner(x):
    x = x + 1
    torch._dynamo.graph_break()
    return x + 2

@torch.compile(backend="eager")
def outer(x):
    return inner(x)

print(outer(torch.ones(3)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159329
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281, #144516
2025-08-26 00:58:07 +00:00
9a756c2d71 [dynamo, nested graph breaks] add nested graph break tests (#144516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144516
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971, #159281
2025-08-26 00:57:58 +00:00
504a6445a4 [dynamo, nested graph breaks] use CALL_FUNCTION_EX when calling resume function (#159281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159281
Approved by: https://github.com/anijain2305
ghstack dependencies: #157971
2025-08-26 00:57:48 +00:00
2df9b437e3 [dynamo, nested graph breaks] implement new resume frame stack/locals/cell layout convention (#157971)
The comments/conventions are not exactly correct here, as the implementation at this PR is partial. They will be fixed in #160138.

No tests added, since there shouldn't be any overall semantic changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157971
Approved by: https://github.com/anijain2305
2025-08-26 00:57:39 +00:00
4e19c1906a Get Inductor periodic CI green (#161297)
I'll file hi-pri issues for the things that need looking into.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161297
Approved by: https://github.com/angelayi
2025-08-26 00:49:49 +00:00
332fa5b388 [Inductor][Triton] Fix SCALING_ROWWISE misclassification for scalar scales (#160450)
Summary: In `tuned_scaled_mm()`, we unsqeeuze any scalar scale from [] -> [1, 1]. Later, when we are determining how to set the `SCALING_ROWWISE` kernel attribute, we check whether the scale has 2 dimensions. However, since we previously unsqueezed any scalar scales, this will always evaluate to True.

Test Plan:
Run the following tests in test/inductor/test_fp8.py:
test_tensorwise_scaling_tma_template
test_rowwise_scaling_tma_template

Rollback Plan:

Differential Revision: D80108117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160450
Approved by: https://github.com/eellison
2025-08-26 00:24:55 +00:00
b9e9e92817 Increase timeout value when pushing to ghcr.io (#161444)
Seeing this timing out a lots in trunk now https://github.com/pytorch/pytorch/actions/runs/17165552358/job/48705069047.  The benchmark image is the largest one we have on CI, so it's probably over the 30 minutes limit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161444
Approved by: https://github.com/atalman
2025-08-25 23:52:59 +00:00
e6aa7287f8 [pytorch] Leverage unordered_map.try_emplace() to simplify code (#161388)
Summary: Because [`unordered_map.try_emplace()`](https://en.cppreference.com/w/cpp/container/unordered_map/try_emplace.html) does not invoke value's constructor if key is already existed, this matches with the previous the behavior on checking the key's existence first, and then instantiate the value.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80916349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161388
Approved by: https://github.com/janeyx99
2025-08-25 23:33:59 +00:00
94b9569c4a Forward fix periodic vision build (#161408)
Trying to forward fix: https://github.com/pytorch/pytorch/issues/161358 use SM 80 architecture by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161408
Approved by: https://github.com/zou3519, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-25 23:28:22 +00:00
2cf7ac2fb7 Issue 160495 inductor complex float (#160736)
Avoiding calling tensor.view(tensor.real.dtype) when tensor.ndim =0 fixes the issue. Called a reshape. Fixes #160495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160736
Approved by: https://github.com/ngimel
2025-08-25 23:23:13 +00:00
447d34b5f8 [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.

This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.
@exported-using-ghexport

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/)

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900
Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305
2025-08-25 23:16:21 +00:00
b2e06e0194 [Inductor] Prune configs that require more shared memory than the hardware limit (#161040)
Summary:
This diff removes configs that require more shared memory than the hardware limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```

Test Plan:
```
buck2 test mode/dev-nosan fbcode//caffe2/test/inductor:max_autotune -- test_max_autotune_prune_choices -v 1,stderr
```

Rollback Plan:

Differential Revision: D80594562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161040
Approved by: https://github.com/eellison
2025-08-25 23:09:09 +00:00
fc69c2bc67 Ensure large tensor int32 -> int64 indexing is enabled (#157767)
Fixes: #https://github.com/pytorch/pytorch/issues/157446

I think that this delta is worth the switch form block-ptrs especially since they are deprecated

## Perf Summary

A is nightly B is this diff, so `negative` means this diff improves perf

TOP 5 differences
<img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" />

<details>
  <summary><strong>Full perf table (click to expand)</strong></summary>

| attn_type | dtype | shape(B,Hq,M,Hkv,N,D) | TFlops Version A | TFlops Version B |
| --- | --- | --- | --- | --- |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 258.38834144791923 | 258.6353685004612 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.2192450677751 | 140.12393320464972 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 122.32683823617003 | 118.51603755647925 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 142.48556906165314 | 137.24259849208627 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64) | 86.59814488695922 | 84.59431398586257 |
| noop | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 288.52679758135764 | 292.9174195871856 |
| causal | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 172.25541683643277 | 172.94326459828508 |
| alibi | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 164.40864610599826 | 165.035129576335 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 176.54876886433945 | 175.08057670028145 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128) | 125.22491679812626 | 121.06201152859151 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 339.11952481874283 | 339.0132835601695 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 227.58583240284406 | 228.21824999409597 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 185.98569659868966 | 182.32850843255093 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 188.9495725191772 | 180.31385312481657 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 64) | 106.25789530994302 | 106.55084959448476 |
| noop | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 357.6430536888533 | 363.30843452247274 |
| causal | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 262.3241154406613 | 265.73250045488 |
| alibi | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 249.30498953911416 | 249.35928192833785 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 224.74126243851808 | 223.71776504077988 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 16, 2048, 128) | 168.26977014013707 | 165.47991483333809 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 382.8178701785897 | 384.34752965862685 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 308.1449710013853 | 311.0653716044644 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 251.96365252505072 | 243.92283557225903 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 226.69316232745368 | 215.22769268913356 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64) | 153.34142545296405 | 151.9312673939401 |
| noop | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 396.0998000753126 | 398.35036286102473 |
| causal | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 333.5198415274966 | 344.6354466169716 |
| alibi | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 310.5955933379696 | 305.66347819546 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 260.4012412689896 | 259.758666997307 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128) | 234.13034252182635 | 227.61676497283614 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 396.17615538477196 | 401.1419104525502 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 359.98648311998414 | 360.8285563463094 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 291.97720707257736 | 281.41694809965253 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 250.1703628419691 | 238.556760291579 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 64) | 199.50782826294306 | 191.52327358439223 |
| noop | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 411.0632004785396 | 413.6362648405517 |
| causal | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 382.9404387613185 | 397.74886235657607 |
| alibi | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 357.0998545146633 | 350.5115200772392 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 281.8033924428203 | 281.98601309215843 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 16, 8192, 128) | 282.56595134222135 | 277.4565795466672 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 408.89838018149516 | 405.14531386840076 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 396.07662058160264 | 393.4598228299578 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 317.8822887267849 | 304.754931401036 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 265.8801304948243 | 254.22961974295112 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 64) | 227.87390579965614 | 222.19481980110393 |
| noop | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 427.36821778477025 | 431.3766620314935 |
| causal | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 410.67994346825 | 423.4666944003808 |
| alibi | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 381.1968748374038 | 381.77668006420424 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 292.5540046358546 | 296.5439130720502 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 16, 16384, 128) | 321.04573768858114 | 310.7423616656888 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 427.46148866769903 | 426.162091037068 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 419.75580537687347 | 421.88640120274334 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 337.3208051798903 | 327.4912454675092 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 276.5638854539581 | 262.988360558083 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 64) | 250.82791326036886 | 245.07367032501736 |
| noop | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 435.8055824506086 | 441.8803729460534 |
| causal | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 432.02638235921006 | 450.33161016596273 |
| alibi | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 402.25525939224883 | 393.8564689669916 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 297.5337286675904 | 297.0131881135074 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 16, 32768, 128) | 343.8697037899545 | 329.8194073407783 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 267.58912366821056 | 256.91606054118375 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 150.81723692609629 | 146.32172267858743 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 129.51029293209245 | 122.72144394093334 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 147.627656359087 | 141.68956350566188 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 64) | 87.55100546003591 | 84.91293287692788 |
| noop | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 299.5931492743986 | 305.884253766691 |
| causal | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 179.39026367843837 | 181.64741311605096 |
| alibi | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 173.93547669282367 | 173.23972950980564 |
| sliding_window | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 185.90234171599252 | 182.80844545446686 |
| document_mask | torch.bfloat16 | (2, 16, 1024, 4, 1024, 128) | 128.08176696266082 | 123.27722685662111 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 340.50674552770664 | 338.9071088484576 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 225.4438318650432 | 230.22899884832975 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 194.15123248528312 | 185.02793973094865 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 200.74289714108176 | 191.76606719670647 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 64) | 107.03564946728423 | 106.82432377861258 |
| noop | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 371.31799283918406 | 379.7555394732925 |
| causal | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 275.97762744310455 | 276.71106853992995 |
| alibi | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 261.6648679783462 | 259.4127232060398 |
| sliding_window | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 237.03108223577615 | 233.92710216149527 |
| document_mask | torch.bfloat16 | (2, 16, 2048, 4, 2048, 128) | 172.13926800371152 | 168.74390922407585 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 381.50199487767276 | 383.9043681999597 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 307.9748883093411 | 312.2403515462001 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 251.11319684705438 | 243.17870127827277 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 236.3253127246763 | 223.81250201769552 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 64) | 154.55693991756874 | 153.11360584987685 |
| noop | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 407.11400078586615 | 413.53709886086557 |
| causal | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 348.1705797722622 | 360.09771155957367 |
| alibi | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 321.8593280850388 | 318.2882327401255 |
| sliding_window | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 270.089032013835 | 268.767323026064 |
| document_mask | torch.bfloat16 | (2, 16, 4096, 4, 4096, 128) | 238.07324557907788 | 228.09842078362692 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 399.8172853171901 | 401.0954526332136 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 363.4387330438581 | 364.13111024232677 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 294.1752429133857 | 283.7235663368415 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 256.8389394007649 | 246.91771015606483 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 64) | 199.3378564292656 | 192.40439590901758 |
| noop | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 425.5150965556111 | 430.8190098707553 |
| causal | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 396.00437184073013 | 411.3873625655787 |
| alibi | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 369.92803661607815 | 361.43244467343663 |
| sliding_window | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 293.4277354412933 | 295.2529537595746 |
| document_mask | torch.bfloat16 | (2, 16, 8192, 4, 8192, 128) | 288.0208673072841 | 281.51896404878863 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 408.3005367220567 | 408.96116482298913 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 396.90095962766304 | 396.87385456176486 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 319.0534576137999 | 302.50950358107764 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 270.3334977708081 | 258.8506349486557 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 64) | 227.46824134365394 | 222.23759438128766 |
| noop | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 438.24247309479694 | 437.7975163205371 |
| causal | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 428.34012029699227 | 433.3215899950434 |
| alibi | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 386.52672049728875 | 388.26216893354984 |
| sliding_window | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 302.71976814728083 | 302.3574867306459 |
| document_mask | torch.bfloat16 | (2, 16, 16384, 4, 16384, 128) | 327.39760662780986 | 308.6348428844912 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 423.31308678262695 | 426.6306972137279 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 412.6983690923106 | 419.4961977664297 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 337.41003544742273 | 324.2155049126126 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 278.7755890910794 | 265.9194286636502 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 64) | 251.55678254755364 | 244.8843180141462 |
| noop | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 452.5930781172308 | 457.7117122300742 |
| causal | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 445.05676260348116 | 463.9304535499636 |
| alibi | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 415.78302138389415 | 406.29229555271456 |
| sliding_window | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 308.0311067300895 | 304.91354721414314 |
| document_mask | torch.bfloat16 | (2, 16, 32768, 4, 32768, 128) | 351.43943626809335 | 329.4476923070317 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 295.1801525813241 | 291.36521287398904 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 183.23250549178067 | 182.35421238887605 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 151.56832453117747 | 151.3422139154794 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 171.02111935180432 | 160.72516856727913 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 64) | 74.05765122783826 | 74.5885345035243 |
| noop | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 314.3587394591763 | 319.2938677773619 |
| causal | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 224.57002084153177 | 225.48868542008177 |
| alibi | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.00964804143052 | 215.39576159953486 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 216.1174237618258 | 214.28437413525663 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 16, 1024, 128) | 121.08920423648368 | 119.55813661872644 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 362.2193857281911 | 360.05005804275936 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 279.8840217430121 | 279.5437918286659 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 227.76617121021982 | 222.8655938229316 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 215.43141176970562 | 207.71852284994702 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 64) | 121.35588364218539 | 121.20636565046884 |
| noop | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 365.1545280898012 | 373.37585444987326 |
| causal | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 304.360119952975 | 309.1247297936263 |
| alibi | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 287.2603904544586 | 289.25547903162595 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 257.9852675272418 | 257.59069234098115 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 16, 2048, 128) | 188.35158496670232 | 184.24683960154857 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 389.9744911369211 | 388.43466897254166 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 345.9228295166513 | 342.63034895210126 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 279.56334658247437 | 271.2724375402088 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 245.66477202810066 | 233.49688207371258 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 64) | 170.3270720653187 | 166.23863845657382 |
| noop | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 400.0041140827554 | 402.11182445396497 |
| causal | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 363.64641830327434 | 375.9288663364792 |
| alibi | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 341.5776139573363 | 335.1160003213424 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 281.1811770268521 | 280.21438270014005 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 16, 4096, 128) | 247.78716118997716 | 245.3269825179633 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 403.794126680488 | 405.2353919019577 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 387.079178426863 | 385.1461762057035 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 309.7847188173431 | 298.0443968374749 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 262.4721750159666 | 250.81679725428586 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 64) | 205.70866004479979 | 202.9620839129557 |
| noop | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 413.380982988662 | 418.40270594263103 |
| causal | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 398.450064800682 | 409.6794973994029 |
| alibi | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 372.26297458194466 | 364.44415106552196 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 293.0818569905912 | 292.85172400643984 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 16, 8192, 128) | 296.46717085592087 | 285.76362010612763 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 419.3186786037592 | 426.08801580934437 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 408.1648467766632 | 409.4122254207817 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 329.24396020457345 | 313.5200995121138 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 274.61257504571876 | 255.7801815432177 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 64) | 232.63806001220684 | 230.03020843492314 |
| noop | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 435.0785891054788 | 440.39101804225345 |
| causal | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 424.86925312752817 | 435.18898057396825 |
| alibi | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 393.000417896268 | 395.11543361225256 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 297.7755459218185 | 300.7208114715287 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 16, 16384, 128) | 331.71570861760534 | 318.07127352552885 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 424.58602747137405 | 425.84897078470715 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 422.66607285025725 | 423.5524945535485 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 344.8625760048626 | 331.6793888458635 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 282.0787281511649 | 263.7895634445868 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 64) | 252.7301927385177 | 245.41844170037427 |
| noop | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 437.0658069164588 | 442.9101960063628 |
| causal | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 433.13788271434646 | 452.3873572709863 |
| alibi | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 404.0959191546953 | 396.7077863894884 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 300.45502211883206 | 301.3439134717943 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 16, 32768, 128) | 344.11003202413934 | 330.8897663350314 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 298.4364205341705 | 291.6793556507056 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 187.6382133139633 | 191.05409897308772 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 156.55822078636112 | 154.178925976516 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 173.47765221825162 | 169.30862508068464 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 64) | 74.5885345035243 | 74.52689061607104 |
| noop | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 323.12233826013045 | 328.53889207933514 |
| causal | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 236.75872140126316 | 235.8378325547398 |
| alibi | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 227.17836523816675 | 226.75357076139966 |
| sliding_window | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 224.07209453308036 | 224.07209453308036 |
| document_mask | torch.bfloat16 | (4, 16, 1024, 4, 1024, 128) | 122.85572156047981 | 121.11642183704716 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 361.3123326658092 | 360.71014086458337 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 281.5287983927017 | 281.94301754758345 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 232.7456696285686 | 226.50976826432776 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 221.5612361744038 | 214.96188822837055 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 64) | 121.38311528944315 | 120.85441868178513 |
| noop | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 380.2579019244734 | 389.2520157863988 |
| causal | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 316.95230660496924 | 317.87597790618906 |
| alibi | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 301.07968126657323 | 298.02424098422983 |
| sliding_window | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 267.2240756921594 | 267.16353549228154 |
| document_mask | torch.bfloat16 | (4, 16, 2048, 4, 2048, 128) | 189.82761622494257 | 186.736450261963 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 389.88665375406805 | 387.9125133037077 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 348.70619958684887 | 346.6750499749774 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 280.5472989906087 | 271.22300822012187 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 250.02397620165968 | 241.22532776331445 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 64) | 171.67817496107645 | 166.95679280483972 |
| noop | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 412.626880230807 | 417.60238657950777 |
| causal | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 374.8829313933945 | 389.4448546468815 |
| alibi | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 353.20410434172436 | 345.7072490717473 |
| sliding_window | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 292.51045924209586 | 291.66621022138287 |
| document_mask | torch.bfloat16 | (4, 16, 4096, 4, 4096, 128) | 251.6264062063495 | 248.45110052911542 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 404.0155784550126 | 401.90546837237514 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 384.4389015599863 | 386.9684324594344 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 313.3731284132225 | 298.17074251037894 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 264.19199737284265 | 252.8982463999916 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 64) | 207.03696315185684 | 202.86697323136772 |
| noop | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 428.2436763312506 | 433.45005568619536 |
| causal | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 411.8516531869893 | 428.2753623461049 |
| alibi | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 384.9095037182509 | 372.90888743000744 |
| sliding_window | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 303.2438915629836 | 302.05095952914337 |
| document_mask | torch.bfloat16 | (4, 16, 8192, 4, 8192, 128) | 301.8689122735564 | 285.0363190513223 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 423.13592231504805 | 420.3991500185611 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 407.44527331585493 | 408.5064370765247 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 330.50050996167414 | 316.8763979925965 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 274.6833786307413 | 259.86098862141324 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 64) | 232.24019584158367 | 226.52040268160232 |
| noop | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 444.4596314237808 | 455.99558915752266 |
| causal | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 437.4245561244369 | 455.98275147271966 |
| alibi | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 397.3350686877605 | 397.88875599028063 |
| sliding_window | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 308.53809114394545 | 307.1359822042007 |
| document_mask | torch.bfloat16 | (4, 16, 16384, 4, 16384, 128) | 331.32379843423774 | 316.85293191675646 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 422.4622274366379 | 425.0407156418684 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 420.9547052783101 | 430.33779243510276 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 345.50265346504085 | 332.094855328957 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 280.81715528243365 | 264.6543640282054 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 64) | 252.25635200421783 | 245.46235499490305 |
| noop | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 452.5524207341139 | 461.7512032176736 |
| causal | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 445.2316469907137 | 464.4523799578466 |
| alibi | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 416.87264016717023 | 409.17124592157046 |
| sliding_window | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 309.42579489389846 | 307.9734464665731 |
| document_mask | torch.bfloat16 | (4, 16, 32768, 4, 32768, 128) | 350.50782004300623 | 330.98959545427294 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767
Approved by: https://github.com/Skylion007
2025-08-25 22:51:00 +00:00
adecb0c9e8 [Cutlass-EVT] Fix buffer size issues (#161335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161335
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #161398
2025-08-25 22:08:30 +00:00
d57c79e609 [Cutlass] Fix regression from f7ad69f (#161398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161398
Approved by: https://github.com/henrylhtsang
2025-08-25 22:08:30 +00:00
1a566c4909 Remove Python 3.9 nightly builds (#161427)
Please see https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161427
Approved by: https://github.com/huydhn
2025-08-25 22:05:40 +00:00
37a34022b5 [Pattern Matcher] improve error msg (#161423)
Updates pattern matcher error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161423
Approved by: https://github.com/mengluy0125, https://github.com/masnesral
2025-08-25 21:48:54 +00:00
763053dc53 Always run OIDC auth on B200 to be able to upload artifacts to S3 (#161436)
Reported by @drisspg , in its current form, the OIDC auth step wasn't run when the previous test step failed.  We need this to always run to be able to upload artifacts to S3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161436
Approved by: https://github.com/nWEIdia, https://github.com/drisspg
2025-08-25 21:05:20 +00:00
cf94cadbee [CUDAGraph] Add getter for cuda graph exec (#161294)
This is far simpler than #155164 since we never destroy the cudaGraphExec_t.

The request comes from TRT-LLM specifically. The motivation is that some power users would like to mutate specific kernel parameters via APIs like `cudaGraphExec*SetParams` after a cuda graph has been instantiated. For example, a common request has been to be able to change the sequence length of attention kernels, after having captured a graph for the largest possible sequence length. It turns out that the host overhead you eliminate via cuda graphs in LLM inference ends up causing an increase in computation time when you size your kernels to the maximum possible sequence length (which I believe is done in both TRT-LLM and vLLM). Attention is the most problematic kernel because its computation time is quadratic in the sequence length, rather than linear.

This can work if your attention kernel can work for arbitrary shapes (this is not the case for all attention implementations! Many of them specialize with templates), and you have a persistent kernel that allocates only as many blocks as you have SM's (so you don't have to figure out how many blocks to allocate for a specific sequence length). Using a conditional SWITCH node is a better generic approach to this problem, but that requires more infrastructure work.

Note that this requires knowledge of the exact location of the value in your kernel's parameter buffer to mutate. It won't work with arbitrary stream capture code whose kernels you don't know before hand. So I expect this code path to be rarely used.

Testing:

```
pytest -s -k raw_graph_exec test/test_cuda.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161294
Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/eqy
2025-08-25 20:57:37 +00:00
995397d47a [inductor] structured-log graph execution order + test (#160448)
Summary:

- Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse.
- Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}.

Testing:
- Add inline test to verify structure and output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448
Approved by: https://github.com/xmfan
2025-08-25 20:12:18 +00:00
ffa1ce7650 Fix the parity of original and exported module parameters (#160600)
## Problem
Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below):

When there are two attribute mapping to the same tensor, the strict mode will
1. Have a standard param buffer table to standardize the name (bug happens [here](f861dc1826/torch/export/_trace.py (L356))! when 2 parameter have same id(param), the latter name will overwrite the previous name)
2. [Update](f861dc1826/torch/export/_trace.py (L1481)) exported signature with updated standard FQN (problematic)
3. When getting exported_program.module(), it will call [_unlift_exported_program_lifted_states](f861dc1826/torch/export/exported_program.py (L1297)) to recover attribute from exported signature where the parameter name is defined and standardized
Then the named_parameter of this module will have overwritten name instead of original name

## How to reproduce the issue?
reproduce issue shared by @taotaohuang001

torch version: 2.8.0
```python
import torch
from torch import nn

# ---- Toy model with embedding weight sharing (aliasing) ----
class Toy(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_layers = nn.ModuleDict()
        tbl = nn.Embedding(100, 8)
        self.embedding_layers["ActorId"] = tbl
        # Alias: reuse the SAME module instance for another feature
        self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"]
        self.proj = nn.Linear(16, 1)

    def forward(self, feats: dict[str, torch.Tensor]):
        e1 = self.embedding_layers["ActorId"](feats["ActorId"])
        e2 = self.embedding_layers["RootActorId"](feats["RootActorId"])
        return self.proj(torch.cat([e1, e2], dim=-1))

torch.manual_seed(0)

m = Toy().eval()

# Show pre-export parameter names (canonicalized; shared weight appears once)
print("PRE-EXPORT named_parameters:")
print([name for name, _ in m.named_parameters()])

# Sanity: the two feature names point to the same weight object
w1 = m.embedding_layers["ActorId"].weight
w2 = m.embedding_layers["RootActorId"].weight
print("PRE-EXPORT alias -> same object:", w1 is w2, "| same storage:", w1.data_ptr() == w2.data_ptr())

# Example inputs (dict structure will be captured by export)
ex_in = {
    "ActorId":     torch.randint(0, 100, (4,)),
    "RootActorId": torch.randint(0, 100, (4,)),
}

# ---- Export (in memory) and materialize the runnable module ----
ep = torch.export.export(m, (ex_in,), strict=True)
gm = ep.module()  # GraphModule with new (canonical) parameter names

print("\nPOST-EXPORT named_parameters (GraphModule):")
post_names = [name for name, _ in gm.named_parameters()]
print(post_names)

# Prove alias persists after export: run fwd/bwd and check a single grad tensor exists
out = gm(ex_in).sum()
out.backward()

# Find the embedding weight in the exported module by shape (100, 8)
emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])]
print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>")

# Show that only one grad exists for the shared table
for name, p in gm.named_parameters():
    if p.grad is not None and p.shape == torch.Size([100, 8]):
        print("Grad present on shared embedding weight:", name, "| grad shape:", tuple(p.grad.shape))
        break

```

And you will see parameters are different before and after export
```
PRE-EXPORT named_parameters:
['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias']
PRE-EXPORT alias -> same object: True | same storage: True

POST-EXPORT named_parameters (GraphModule):
['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias']

Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight
Grad present on shared embedding weight: embedding_layers.RootActorId.weight | grad shape: (100, 8)

```
## Solution
Fixing this issue by making sure latter named parameter will not overwrite the `param_buffer_table` when original model's named parameter already maps to certain parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160600
Approved by: https://github.com/angelayi
2025-08-25 19:40:06 +00:00
3e210f90c2 Revert "[dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)"
This reverts commit 1113e7de30da95973c1eac7921601f9a0e94f2db.

Reverted https://github.com/pytorch/pytorch/pull/160900 on behalf of https://github.com/atalman due to executorch failure ([comment](https://github.com/pytorch/pytorch/pull/160900#issuecomment-3221372096))
2025-08-25 18:56:18 +00:00
660b5656a4 Inline is_read_only_alias_match in _correct_storage_aliasing (#161285)
Drives down the overhead of return_and_correct_storage_aliasing slightly. Hopefully you'll agree it doesn't compromise readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161285
Approved by: https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235, #161240, #161284
2025-08-25 18:35:21 +00:00
0e0bb4f1fd Remove unnecessary len() call in _correct_storage_aliasing.is_read_only_alias_match (#161284)
Containers are truthy iff they're non-empty.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161284
Approved by: https://github.com/Skylion007, https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235, #161240
2025-08-25 18:35:21 +00:00
b048f0e189 Improve efficiency of _python_dispatch.return_and_correct_aliasing (#161240)
get_write_alias() call count reduction explained briefly in code comment.

We don't need to check write_aliases against None in the final outs_to_return calculation because we just did that check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161240
Approved by: https://github.com/wconstab
ghstack dependencies: #161231, #161234, #161235
2025-08-25 18:35:21 +00:00
c35538d3c5 Minor cleanup of DeviceMesh.__eq__ (#161235)
`self is other` means the same thing as `id(self) == id(other)`, but it's one operator instead of 3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161235
Approved by: https://github.com/wconstab, https://github.com/zpcore, https://github.com/fduwjj
ghstack dependencies: #161231, #161234
2025-08-25 18:35:21 +00:00
cfafd98c53 Use comparison key in OpSchema to avoid duplicate work between __hash__ and __eq__ (#161234)
The performance cost of `dict` lookups keyed by `OpSchema` is a
significant minority of DTensor overhead. With this change we shave a
net ~1% off the total running time of the benchmark from #160580, as
measured by using cProfile and comparing cumulative time spent in
propagate + OpSchema's `__post_init__`. (`__post_init__` grew from
2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161234
Approved by: https://github.com/wconstab
ghstack dependencies: #161231
2025-08-25 18:35:21 +00:00
5d6434b132 Fix OpSchema equality check (#161231)
`__eq__` didn't compare lists of DTensorSpec, but `__hash__` did (and
it looks like attention was paid to hash, so I made comparison follow
suit).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161231
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/zpcore
2025-08-25 18:35:21 +00:00
2f0de0ff93 [Inductor] Update Intel Triton for PyTorch 2.9. (#161050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161050
Approved by: https://github.com/anmyachev, https://github.com/EikanWang, https://github.com/jansel
2025-08-25 17:18:19 +00:00
c081481bbe [aoti-fx] Output OpOverload fallbacks (#161195)
Updates the inductor-wrapper-fxir code to use the kernel.op_overload when generating extern kernel calls. This way we can keep the IR consistent with using ATen ops.

TODO: we're also inserting torch.empty_strided calls -- need to turn this into aten too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161195
Approved by: https://github.com/blaine-rister
2025-08-25 17:03:05 +00:00
df571ae7ad Revert "Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)"
This reverts commit 3ea6cc8c2d443d6104159d50e8328c144f6caa39.

Reverted https://github.com/pytorch/pytorch/pull/159387 on behalf of https://github.com/jeffdaily due to breaks ROCm, AttributeError: 'torch._C._CudaDeviceProperties' object has no attribute 'shared_memory_per_block_optin' ([comment](https://github.com/pytorch/pytorch/pull/159387#issuecomment-3220989480))
2025-08-25 16:50:03 +00:00
9e1c954134 [dynamo] Pass requires_grad to nn.Parameter construction (#161364)
Fixes https://github.com/pytorch/pytorch/issues/161191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161364
Approved by: https://github.com/Skylion007, https://github.com/StrongerXi
2025-08-25 16:49:28 +00:00
83283ce7f5 docstring_linter: Fix #151692 and other issues (#156596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156596
Approved by: https://github.com/eellison
2025-08-25 16:04:14 +00:00
ab8d60f4c8 [ROCm] Unroll loads in global_reduce (#161181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161181
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-25 15:45:49 +00:00
af3265d20f [BE][CI] fix pkg=<pin> to pkg==<pin> in pip requirement specs (#160811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160811
Approved by: https://github.com/seemethere
2025-08-25 15:31:21 +00:00
f391afe9bf [cuDNN][convolution] remove redundant conv3d 64bit test (#161177)
turns out it's the same as
```
    @onlyCUDA
    @largeTensorTest("40GB")
    @largeTensorTest("24GB", "cpu")
    @tf32_on_and_off(0.005)
    def test_conv3d_64bit_indexing(self, device):
        x = torch.rand(1, 32, 512, 512, 256)
        m = torch.nn.Conv3d(32, 1, kernel_size=1, padding=0, stride=1, bias=False)
        yref = m(x)
        y = m.to(device=device)(x.to(device=device))
        self.assertEqual(yref, y)
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161177
Approved by: https://github.com/Skylion007
2025-08-25 15:01:05 +00:00
1113e7de30 [dynamo] Refactor convert_frame.compile_frame to be self contained function. [5/n] (#160900)
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.

This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.
@exported-using-ghexport

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801/)

Differential Revision: [D80469801](https://our.internmc.facebook.com/intern/diff/D80469801)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160900
Approved by: https://github.com/tugsbayasgalan, https://github.com/anijain2305
2025-08-25 14:53:54 +00:00
40c0e700a4 Revert "[AMD] Fix AMD User Defined Kernel Autotune (#160671)"
This reverts commit 431846a6323c6f1d02da49e311ac694324f386f4.

Reverted https://github.com/pytorch/pytorch/pull/160671 on behalf of https://github.com/atalman due to new test is failing: inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_rocm_triton_autotuning_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17172795679/job/48725235301) [HUD commit link](431846a632) ([comment](https://github.com/pytorch/pytorch/pull/160671#issuecomment-3220442141))
2025-08-25 14:07:48 +00:00
510825e5fe Optimize dynamo typing (#147499)
Optimize dynamo methods type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147499
Approved by: https://github.com/anijain2305
2025-08-25 13:20:45 +00:00
ab7787fb82 Revert "[inductor] Windows inductor use intel-openmp. (#160258)"
This reverts commit 41673110cd7c5960824cc74a6fcaeda1a8bc7a23.

Reverted https://github.com/pytorch/pytorch/pull/160258 on behalf of https://github.com/malfet due to Reverting to fix https://github.com/pytorch/pytorch/issues/160898 and https://github.com/pytorch/pytorch/issues/160962 ([comment](https://github.com/pytorch/pytorch/pull/160258#issuecomment-3220158145))
2025-08-25 12:57:47 +00:00
1eccfb157a Revert "[BE] Remove intel-openmp dependency in setup.py (#160976)"
This reverts commit e4839470470168648dee5997f57347bb8541ea2b.

Reverted https://github.com/pytorch/pytorch/pull/160976 on behalf of https://github.com/malfet due to This PR is doing something strange ([comment](https://github.com/pytorch/pytorch/pull/160976#issuecomment-3220120462))
2025-08-25 12:46:12 +00:00
4651aaac47 Fix typo: 'complext' (#160335)
minor fix for a typo: `complext` to `complex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160335
Approved by: https://github.com/Skylion007
2025-08-25 10:37:59 +00:00
037c43d3b2 [tgif] fix getattr_recursive with ModuleList (#161204)
Summary: This change updates `getattr_recursive`  to handle qualnames with ModuleList that contain digit indices, for example, `op_instances.1.value_model.feature_weights`

Test Plan:
TBA

Rollback Plan:

Reviewed By: jiayisuse

Differential Revision: D80503985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161204
Approved by: https://github.com/jiayisuse
2025-08-25 10:08:47 +00:00
eb5549a431 xpu: fix cpp_extension compatibility with oneapi dpc++ 2025.2 compiler (#161012)
Intel oneapi DPC++ compiler has changed (fixed) parsing of `-fsycl-host-compiler-options` option in the respect of treating arguments with escaped quotes. This commit adds an if branches depending on compiler versions.

Fixes: https://github.com/intel/torch-xpu-ops/issues/1938

CC: @chuanqi129 @EikanWang @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161012
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-25 09:29:53 +00:00
56ebed627a [OpenReg] Add OSX/Windows Support for OpenReg (#159441)
As the title stated.

**Changes:**

- Abstract platform-specific APIs
- Add OSX/Windows support
- Set default symbol visibility to "hidden"

Co-authored-by: @can-gaa-hou

Original PR:https://github.com/pytorch/pytorch/pull/159029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159441
Approved by: https://github.com/albanD

Co-authored-by: jiahaochen666 <jiahaochen535@gmail.com>
2025-08-25 08:03:27 +00:00
80df27a612 port distributed pipeline test files for Intel GPU (#159033)
In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. instantiate_device_type_tests()
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
4. use "get_default_backend_for_device()" to get backend
5. enabled XPU for some test path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033
Approved by: https://github.com/guangyey, https://github.com/kwen2501
2025-08-25 05:24:27 +00:00
e3d68dfae2 [DTensor] Make default RNG semantics match user-passed generator (#160482)
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.

After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).

The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator.  The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.

Fixes #159991

confirmed docs rendered OK

<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
2025-08-25 04:21:19 +00:00
726dce3c94 [nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks (#161238)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161238
Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed
2025-08-25 03:09:32 +00:00
74280d0913 [muon] Introduce Muon optimizer to PyTorch (#160213)
A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy.

This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory.

**Usage**

```python
    model = MyModelForCausalLM
    # filter out your params manually
    muon_params = [...]
    adamw_params = [...]
    muon = Muon(
        params = muon_params
        lr=lr,
        wd=wd,
    )
    adamw = AdamW(
        params = adamw_params
        lr=lr,
        wd=wd,
    )

    # in training loop
    loss = model(input)
    loss.backward()
    muon.step()
    adamw.step()
    muon.zero_grad()
    adamw.zero_grad()
```

~~**Additional usage**~~
~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~

```python
~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~
~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~
```

As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs.

By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW.

**Testing**

~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~

As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test.

2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.
<img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" />

**Numerics**
We evaluate our implementation with existing implementation to confirm numerical consistency.

As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap.

As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below.

```python
    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = KellySingleDeviceMuon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)
```

As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without.
<img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" />

**Performance**
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

- adamw_ddp finishes in 13.12 min
- pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is *2.5%* slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms
<img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" />

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms
<img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" />

**Note**
We restrict the implementation to accept only 2D parameters.

An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful.

Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound.

**Next Steps**

1. Add `MuP`
2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
3. Open-source unsharded Muon co-designed with FSDP2.

****

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213
Approved by: https://github.com/janeyx99
2025-08-24 08:03:04 +00:00
1de4540449 Use -compress-mode=size for CUDA 13 build for binary size reduction (#161316)
https://github.com/pytorch/pytorch/issues/159779

CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/

Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. https://github.com/pytorch/pytorch/pull/157791#issuecomment-3058027353

Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions.

Related - https://github.com/pytorch/pytorch/pull/157791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161316
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-08-24 03:28:29 +00:00
3e5b021f21 [ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)
This pull request adds the following ops for sparse matrices using Eigen library:
```python
    add(a_csr, b_csr)
    add(a_csc, b_csc)

    addmm(c_csr, a_csr, b_csr)
    addmm(c_csr, a_csr, b_csc)
    addmm(c_csr, a_csc, b_csc)
    addmm(c_csr, a_csc, b_csr)

    addmm(c_csc, a_csr, b_csr)
    addmm(c_csc, a_csr, b_csc)
    addmm(c_csc, a_csc, b_csc)
    addmm(c_csc, a_csc, b_csr)
```

Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.

This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
2025-08-23 19:03:55 +00:00
4acdbb8311 [MPS] Fix index_copy for strided indices (#161333)
By passing strides to strided variant of the tensor

Fixes https://github.com/pytorch/pytorch/issues/160993
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161333
Approved by: https://github.com/huydhn, https://github.com/wdvr
ghstack dependencies: #161206, #161267
2025-08-23 14:38:57 +00:00
f912c93344 Revert "Move non inductor workflows to Python 3.9 -> 3.10 (#161182)"
This reverts commit e20f6d798606f3245686e950c43635bbe526232d.

Reverted https://github.com/pytorch/pytorch/pull/161182 on behalf of https://github.com/zou3519 due to broke dynamo_wrapped tests, those are a bit finicky to fix (there is probably more than one failure!) ([comment](https://github.com/pytorch/pytorch/pull/161182#issuecomment-3216953097))
2025-08-23 13:00:42 +00:00
33346b5814 Support NUMA Binding for Callable Entrypoints, Take 2 (#161183)
# Context
In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges.

Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set.

# This PR
On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836)

Specifically, for each local rank `i`:
1. The parent process sets its own CPU affinity to what local rank `i`'s should be.
2. Then, the parent spawns the subprocess for local rank `i`.
3. Finally, the parent resets its own CPU affinity to what it was originally.

There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple.

This required a bit of refactoring:
1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`.
2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.).
3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
4. Use the context manager for both `str` and `Callable` paths

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183
Approved by: https://github.com/d4l3k
2025-08-23 07:23:22 +00:00
431846a632 [AMD] Fix AMD User Defined Kernel Autotune (#160671)
Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel.

Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
can succeed after this change.

Rollback Plan:

Differential Revision: D80285441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671
Approved by: https://github.com/muchulee8
2025-08-23 07:23:09 +00:00
cd31be28ec Reland D80238201: [Torch.Export] Add flat arg paths in error message (#160919)
Summary:
[The diff was reverted due to CLA error, in the process of retrieving account]
Previous error message
```
RuntimeError: Expected input at *args.<unknown location>.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)
```
New error message
```
RuntimeError: Expected input at *args.[0].supervision_input.weight.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)
```

Test Plan:
```
buck test mode/opt apf/rec/ir/tests:ir_export_deserialize_test
```
https://www.internalfb.com/intern/testinfra/testrun/4785074906254375

```
buck run mode/opt caffe2/test:test_export -- -r unflatten
```

```
Ran 413 tests in 208.414s

OK (skipped=1, expected failures=13)
```

Rollback Plan:

Differential Revision: D80487367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160919
Approved by: https://github.com/angelayi
2025-08-23 07:20:58 +00:00
710514a2a5 Revert "Enable output padding when only outermost dim is dynamic (#159404)"
This reverts commit f15ada5c6fad97a7dcbfa4673f067b6942dda640.

Reverted https://github.com/pytorch/pytorch/pull/159404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159404#issuecomment-3216517032))
2025-08-23 07:17:30 +00:00
22df59efc0 [inductor] add MSVC language pack check. (#161298)
Check MSVC's language pack: https://github.com/pytorch/pytorch/issues/157673#issuecomment-3051682766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161298
Approved by: https://github.com/angelayi
2025-08-23 07:06:48 +00:00
3a4140bf8e [FlexAttention] fixing learnable bias assertion error in inductor (#161170)
Users encountered unexpected behaviour when using FlexAttention with learnable biases, including assertion errors (#157677)

We traced the root cause to the registration of subgraph buffers—this caused inconsistencies in the naming and ultimately incorrect retrieval later on. This problem only arose if the model was compiled as a whole (ie using @torch.compile) since only then would there be naming conflicts.

In this PR, we register the buffers with the base graph to solve this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161170
Approved by: https://github.com/drisspg
2025-08-23 06:24:22 +00:00
6443ea337d enable more tests (#161192)
Enable more vllm test against pytorch main, add schedule to run the test every 12 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161192
Approved by: https://github.com/huydhn
2025-08-23 06:01:22 +00:00
36ac916929 [ONNX] Fix lower opset version support in dynamo=True (#161056)
After we switched to constructing the registry with the specified opset version in dynamo=True, support for opset<18 was broken because there would be no torchlib ops registered for these opsets. I updated the registry creation logic to always use opset 18 if the requested opset is lower, and use the version converter (as designed) to target those opsets.

This requires onnxscript>=0.4 (https://github.com/pytorch/pytorch/pull/161312)

Fixes https://github.com/onnx/onnx/issues/7235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161056
Approved by: https://github.com/titaiwangms
2025-08-23 05:04:36 +00:00
7131bfab89 [vllm hash update] update the pinned vllm hash (#161227)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161227
Approved by: https://github.com/pytorchbot
2025-08-23 04:25:16 +00:00
ac8d9418ae [audio hash update] update the pinned audio hash (#161331)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161331
Approved by: https://github.com/pytorchbot
2025-08-23 04:21:03 +00:00
38a492d40d [ONNX] Remove unused _onnx_supported_ops (#161322)
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161322
Approved by: https://github.com/titaiwangms
2025-08-23 02:42:25 +00:00
394728bab2 [MPS] Update avg_pool3d kernel to use opmath_t (#161071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161071
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #161011
2025-08-23 02:36:22 +00:00
121afd6a8f [MPS] Update avg_pool2d to use Metal kernel when ceil_mode=True (#161011)
Fixes #160743

The MPS impl of `avg_pool2d` seems to only give incorrect results when `ceil_mode=True`. I wrote a performance measurement script (0ee6e58643/avg_pool_mps/perf_2d.py) which tests a bunch of different cases and also marks the cases where MPS and CPU results do not match.

I found that if I update `avg_pool2d` to use the new Metal kernel in all cases, that fixes all the mismatches, but it also decreases performance for some of the `ceil_mode=False` cases. So I opted to only run the new Metal kernel when  `ceil_mode=True`, which does not significantly decrease performance in any of the cases tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161011
Approved by: https://github.com/malfet
2025-08-23 02:36:22 +00:00
d228a776e9 [Inductor-FX] Support Tensorbox outputs (#161245)
# Problem
The FX converter previously supported graph outputs which were `StorageBox`, but not `TensorBox`. The latter seems to show up in certain cases when the output is a slice/view of the input.

# Fix
This PR generalizes the code to handle `MutableBox` instead of `StorageBox` specifically.

# Test
Added a CI test exposing the issue. The test case was found by intentionally breaking `TensorBox(ReinterpretView` support in https://github.com/pytorch/pytorch/pull/161258.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161245
Approved by: https://github.com/angelayi
2025-08-23 02:04:13 +00:00
cee72119b2 [Test] Adding a testcase for constant_pad_nd (#161259)
Fixes #161066

This PR adds a simple testcase for constant_pad_nd on MPS as mentioned in https://github.com/pytorch/pytorch/pull/161149#issuecomment-3211701274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161259
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-23 01:00:50 +00:00
47d267364c Revert "[SymmMem] Support rendezvous on slice of a tensor (#160825)"
This reverts commit 9d9cc9897ac44a1a8df38211b03d8342a8af48c3.

Reverted https://github.com/pytorch/pytorch/pull/160825 on behalf of https://github.com/kwen2501 due to Change of course; use storage_ptr as key ([comment](https://github.com/pytorch/pytorch/pull/160825#issuecomment-3215951048))
2025-08-22 23:41:55 +00:00
0d9da384ef Bump onnxscript to 0.4.0 in CI (#161312)
Use onnxscript apis for torch 2.9.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161312
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-08-22 23:23:08 +00:00
f521e82a4e Update pyrefly config for better codenav (#161200)
This fixes behavior in codenav by switching from `replace_imports_with_any` to `ignore-missing-imports`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161200
Approved by: https://github.com/aorenste, https://github.com/albanD
2025-08-22 23:05:07 +00:00
bcfe1b2d71 Add initial bc-linter configuration (#161319)
Preparation for https://github.com/pytorch/test-infra/pull/7016

Currently merging this PR is a noop change for PyTorch repo (bc-linter is not looking at the config yet).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161319
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2025-08-22 22:54:25 +00:00
419a2dbf5f [ONNX] Remove enable_fake_mode and exporter_legacy (#161222)
Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222
Approved by: https://github.com/titaiwangms
2025-08-22 22:15:27 +00:00
3373b074f5 [Profiler] Add GC Events to Python Stack Tracer (#161209)
Summary:
Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL.

Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on!

Test Plan:
Ran trace with GC induced and saw it on trace

Also added a test

Rollback Plan:

Differential Revision: D80491146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209
Approved by: https://github.com/ngimel
2025-08-22 22:11:25 +00:00
c8bb0e4720 [MPS] Fix index_copy for scalars (#161267)
By `squeezing the input` when copying into scalar tensor from a 1d one
And enable `test_index_copy_scalars_mps`

Fixes https://github.com/pytorch/pytorch/issues/160737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161267
Approved by: https://github.com/manuelcandales, https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #161206
2025-08-22 21:45:34 +00:00
4c36c8a994 [dynamo] Support method calls on complex ConstantVariables (#161122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161122
Approved by: https://github.com/mlazos, https://github.com/guilhermeleobas
2025-08-22 21:40:03 +00:00
9d882fd9ff [benchmark] Add torchscript jit.trace to benchmark option (#161223)
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
2025-08-22 21:38:28 +00:00
2835cc5e91 [cuDNN] head dim > 128 works on H100 again in cuDNN SDPA? (#161210)
reference: https://github.com/pytorch/torchtitan/pull/1610

9.10 only for now, we would want to hold off on upgrading to either cuDNN frontend 1.14+/cuDNN 9.11+ due to some head-dim > 128 handling issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161210
Approved by: https://github.com/Skylion007
2025-08-22 21:21:53 +00:00
3f1a97a99c Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 44549c7146bd6c4166f97e856037babe1b7f4f49.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135))
2025-08-22 20:48:46 +00:00
981ac533c6 Revert "Close some sources of fake tensor leakages (#159923)"
This reverts commit 5afa4187dfe1e99278f8e372ec09102d5b937572.

Reverted https://github.com/pytorch/pytorch/pull/159923 on behalf of https://github.com/zou3519 due to broke aoti test in inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/159923#issuecomment-3215580688))
2025-08-22 20:42:50 +00:00
3ea6cc8c2d Fix conv exhaustive autotuning and expand Exhaustive test coverage (#159387)
Conv exhuastive currently throws an error, and I think it's worth adding tests to the other ops too in order to prevent regression in exhaustive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159387
Approved by: https://github.com/coconutruben
2025-08-22 20:06:09 +00:00
2c0650a00a Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)"
This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e.

Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))
2025-08-22 19:10:35 +00:00
eba1ad09e4 Revert "[SymmMem] Support rendezvous on view of a tensor (#160925)"
This reverts commit 9d7cecdd6c44c5421d341bcc359be4097ea9a2f5.

Reverted https://github.com/pytorch/pytorch/pull/160925 on behalf of https://github.com/kwen2501 due to Change of course: use storage ptr as symm mem keys as in the old days and force no_split in MemPool ([comment](https://github.com/pytorch/pytorch/pull/160925#issuecomment-3215315717))
2025-08-22 18:59:25 +00:00
a43480d19c [CD] Enable triton xpu Windows build for Python 3.14 (#161255)
Follow #159869
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161255
Approved by: https://github.com/atalman
2025-08-22 18:39:31 +00:00
17b0263e86 [inductor] fix march=native pass to Windows CC. (#161264)
fix march=native pass to Windows CC.

<img width="593" height="218" alt="image" src="https://github.com/user-attachments/assets/1caedffa-d9be-43d9-9ce2-590c055980cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161264
Approved by: https://github.com/angelayi
2025-08-22 18:38:51 +00:00
97200c9711 [inductor] Add get page_size support for Windows. (#161273)
`resource` can't work on Windows, as it is a Unix specific package as seen in https://docs.python.org/2/library/resource.html

Use Windows system API to get page_size.

Local tested:
<img width="467" height="433" alt="image" src="https://github.com/user-attachments/assets/47a39060-3aea-46c3-bd8e-35a39413c51f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161273
Approved by: https://github.com/angelayi
2025-08-22 18:36:14 +00:00
1d458e2947 Revert "[Inductor] Update Outer Reduction Heuristic (#159093)"
This reverts commit f085f299584b06a2a7d8855eda2a411313e782ad.

Reverted https://github.com/pytorch/pytorch/pull/159093 on behalf of https://github.com/seemethere due to this fails internal tests, see D80630416 for more info ([comment](https://github.com/pytorch/pytorch/pull/159093#issuecomment-3215263317))
2025-08-22 18:35:36 +00:00
266784ec6a remove old while_loop_schema_gen test (#161202)
Fixes https://github.com/pytorch/pytorch/issues/141202.

This test is flaky for mysterious reasons and we have created a new way of creating schemas for hops. So delete the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161202
Approved by: https://github.com/zou3519
2025-08-22 18:22:29 +00:00
25df65afd8 [ROCm] revamp HIPCachingAllocatorMasqueradingAsCUDA (#161221)
HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead.

This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-22 18:13:12 +00:00
e20f6d7986 Move non inductor workflows to Python 3.9 -> 3.10 (#161182)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161182
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-08-22 16:48:43 +00:00
c2390087c3 [MPS] Fix index_select for scalar_types (#161206)
By copy-n-pasting logic from `index_select_out_cpu` (and `_cuda`), where essentially the resizing is done inside the op,  which also fixes faulty logic for scalars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161206
Approved by: https://github.com/manuelcandales
2025-08-22 16:45:35 +00:00
f09458c2e1 Enable test/test_numpy_interop.py config in mypy (#158556)
## Test Result

```bash
lintrunner --take MYPY test/test_numpy_interop.py

Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556
Approved by: https://github.com/soulitzer
2025-08-22 16:18:58 +00:00
7fcdd8d6af Use ROCm MI325 runners for trunk.yml (#161184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161184
Approved by: https://github.com/jeffdaily
2025-08-22 16:18:55 +00:00
c7a77470c5 Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482)"
This reverts commit d1faf2ef0476eb60b42c057baee9af0f48ae849a.

Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))
2025-08-22 15:04:28 +00:00
ce467df5d1 rm platform args xplat/langtech/mobile/BUCK (#161018)
Differential Revision: D80460691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161018
Approved by: https://github.com/drisspg
2025-08-22 14:47:36 +00:00
db44de4c0d [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-22 14:19:57 +00:00
639b8cc51d Revert "cd: Add no-cache for test binaries (#149218)"
This reverts commit 523bffd38856dc9fca36bddded64f74822a6e1a2.

Reverted https://github.com/pytorch/pytorch/pull/149218 on behalf of https://github.com/atalman due to Lets not use no-cache flags on test binaries ([comment](https://github.com/pytorch/pytorch/pull/149218#issuecomment-3214338844))
2025-08-22 13:14:23 +00:00
49ff884b1e Add CUDA 13.0 x86 builds (#160956)
https://github.com/pytorch/pytorch/issues/159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:
1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160956
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-08-22 11:31:09 +00:00
a68f63e331 Add Windows CUDA 13 build and magma script (#161073)
Add magma build 13.0 for Windows
Add cuda_install.bat 13.0 for Windows build
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161073
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-08-22 11:24:25 +00:00
774b4befa1 [BE] [dynamo] Simplify two methods in ConstDictVariable (#159361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159361
Approved by: https://github.com/anijain2305
2025-08-22 11:11:30 +00:00
2beffb3311 Refactoring TensorImpl by using constexpr and std::is_same_v (#161043)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161043
Approved by: https://github.com/Skylion007
2025-08-22 10:49:49 +00:00
9b4adc4db7 [fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568)
Adds support for FlightRecorder in ProcessGroupXCCL.

See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568
Approved by: https://github.com/guangyey, https://github.com/fduwjj
2025-08-22 09:03:35 +00:00
9e491f753e [dynamo] Remove extra if statement in builder _wrap (#161215)
Removes a redundant if statement. Does not impact logic so no test changes needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161215
Approved by: https://github.com/StrongerXi
2025-08-22 08:56:06 +00:00
373e25c2eb Disable background threads for XPU host allocator (#161242)
# Motivation
https://github.com/pytorch/pytorch/pull/160505 enables background threads for XPU host allocator. However, it will hang on Windows during program exit. Now disable it until we narrow down the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161242
Approved by: https://github.com/EikanWang
2025-08-22 08:40:13 +00:00
595987d28d [bucketing] allow convert_element_type after fsdp reduce_scatter (#161159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161159
Approved by: https://github.com/eellison
2025-08-22 06:41:50 +00:00
c4670e40c9 [inductor] remove Windows unsupported build options. (#161197)
Changes:
1. Math related build option is not supported by msvc, skip them on Windows.
2. Move all math related build option to `_get_ffast_math_flags` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161197
Approved by: https://github.com/jansel
2025-08-22 06:23:43 +00:00
9b3ebd25ac [inductor] Enable max compatible to msvc for oneAPI headers. (#161196)
Enable max compatible to msvc for oneAPI headers.

The key context is `The /permissive- option is compatible with almost all of the header files from the latest Windows Kits` from https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161196
Approved by: https://github.com/jansel
2025-08-22 06:23:26 +00:00
f8bd85827d Optimzie zero_grad description (#161239)
Optimize [zero_grad doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) format and description.

## Test Result

### Before

<img width="996" height="534" alt="image" src="https://github.com/user-attachments/assets/e1db973c-57e8-4525-90e7-0500cde2263d" />

### After

<img width="890" height="496" alt="image" src="https://github.com/user-attachments/assets/5579c4fb-a857-4030-9303-34770083d1a5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161239
Approved by: https://github.com/janeyx99
2025-08-22 06:18:25 +00:00
bc7eaa0d8a [BE] Remove the default TORCH_CUDA_ARCH_LIST in CI Docker image (#161137)
This doesn't make sense to have this default to Maxwell, which is too old.  All other places in CI/CD needs to overwrite this value.  IMO, it makes more sense to not set this at all and let CI/CD jobs set it for their own use cases instead.  This is partly responsible for the build failure in https://github.com/pytorch/pytorch/issues/160988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161137
Approved by: https://github.com/msaroufim
2025-08-22 06:03:11 +00:00
0dea191ff7 [VLLM TEST]setup test workflow (#160583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160583
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-08-22 05:38:39 +00:00
8aad3a60ce [dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) (#161036)
Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: 5ee464db5c/torch/_dynamo/variables/misc.py (L671)

Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor.

FIXES https://github.com/pytorch/pytorch/issues/160901

It should also be the root cause behind the issue in https://github.com/pytorch/torchtitan/pull/1604 @bdhirsh  @ruisizhang123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161036
Approved by: https://github.com/anijain2305
ghstack dependencies: #160805
2025-08-22 04:43:22 +00:00
c7fb031706 [audio hash update] update the pinned audio hash (#161226)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161226
Approved by: https://github.com/pytorchbot
2025-08-22 04:22:08 +00:00
c60dea5261 [export] Allow tempfile._TemporaryFileWrapper in package_pt2 (#161203)
Summary:
We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py`

However, it is not recognized as an allowed file format and a warning will be thrown.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80740916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203
Approved by: https://github.com/angelayi
2025-08-22 04:10:35 +00:00
bf8431ba06 [inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE (#159233)
Fixes #158076

Basically, the gemm template generates code like
```
cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
            &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
            &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
            &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
            static_cast<int64_t>(m_end + ((-1LL)*m_start)),
            static_cast<int64_t>(Nr),
            static_cast<int64_t>(k_end + ((-1LL)*k_start)),
            static_cast<int64_t>(196LL),
            static_cast<int64_t>(80LL),
            static_cast<int64_t>(Nc_blocks*Nr)
        );
```

However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access.

The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly.

I think a good way to fix this issue is to create a new matrix that has no storage offset.

When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue.

BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct.

The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.

I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233
Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
2025-08-22 03:47:28 +00:00
2fdd4f918c Log exception_stack_trace to dynamo_compile (#161096)
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.

Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n  File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n    x = 1/0\n        ~^~\nZeroDivisionError: division by zero\n']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096
Approved by: https://github.com/c00w
2025-08-22 03:29:15 +00:00
31a41daff4 [ROCm][Windows] Include native_transformers srcs to fix link errors. (#160373)
Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like
```
[7019/7028] Linking CXX shared library bin\torch_hip.dll
FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests  -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp  /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64)
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64))
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64))
```

The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available.

I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373
Approved by: https://github.com/alugorey, https://github.com/jeffdaily
2025-08-22 01:43:25 +00:00
cc791d5857 Quick fix to headers in stable/tensor_inl.h (#161168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161168
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2025-08-22 01:27:44 +00:00
be2e6b3158 [export] Remove unused Model, tensor_paths, constant_paths (#161185)
Summary:
Removed `Model`, it's not being used anywhere so it's safe.

Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram`
- BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields
- FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()`

Differential Revision: D80725094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185
Approved by: https://github.com/SherlockNoMad
2025-08-22 01:07:01 +00:00
a85711d565 Avoid making node a successor/predecessor of itself (#161205)
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161205
Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
2025-08-22 00:30:29 +00:00
ff4f5dd8ed [nativert] oss layout planner tests (#160942)
Summary: att - changed one of the tests to get rid of torcharrow dep.

Test Plan:
```
buck2 test //caffe2/test/cpp/nativert:layout_planner_tests
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D80108549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160942
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-08-22 00:26:25 +00:00
46429be723 [DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205)
Parallelize reading of data behind thread_count argument to HFStorageReader
Test plan: ensure existing tests pass and run a job successfully with these changes

Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205
Approved by: https://github.com/meetv18
2025-08-21 23:58:02 +00:00
f5bf5147ad Bump uv from 0.8.4 to 0.8.6 in /.ci/lumen_cli (#161212)
Bumps [uv](https://github.com/astral-sh/uv) from 0.8.4 to 0.8.6.
- [Release notes](https://github.com/astral-sh/uv/releases)
- [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/uv/compare/0.8.4...0.8.6)

---
updated-dependencies:
- dependency-name: uv
  dependency-version: 0.8.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-21 15:54:34 -07:00
fc0683b1e7 Revert "[ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)"
This reverts commit ce048de608180fa88335e5821070472539968b54.

Reverted https://github.com/pytorch/pytorch/pull/155357 on behalf of https://github.com/seemethere due to This is causing buck builds to fail since we didn't add the definition of AT_USE_EIGEN_SPARSE in the buckbuild.bzl file, will follow-up and re-land this. ([comment](https://github.com/pytorch/pytorch/pull/155357#issuecomment-3212270510))
2025-08-21 22:38:40 +00:00
cb57953215 [BE] Enable test_index_put_accumulate_duplicate_indices on MPS (#161201)
By changing dtype to float if device is MPS

Note: for some reason test runs much longer on MPS than on CPU
```
% python ../test/test_indexing.py -v -k test_index_put_accumulate_duplicate_indices_mps
test_index_put_accumulate_duplicate_indices_mps (__main__.TestIndexingMPS.test_index_put_accumulate_duplicate_indices_mps) ... ok

----------------------------------------------------------------------
Ran 1 test in 9.139s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161201
Approved by: https://github.com/dcci
2025-08-21 22:05:42 +00:00
f085f29958 [Inductor] Update Outer Reduction Heuristic (#159093)
Update outer reduction heuristics for significant speedups.

HuggingFace:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 44 51 AM" src="https://github.com/user-attachments/assets/4872a23b-d136-423a-b2e6-187895bccba1" />

Average ~20% speedup on a kernel by kernel basis

TorchBench:
<img width="572" height="705" alt="Screenshot 2025-08-20 at 12 45 10 AM" src="https://github.com/user-attachments/assets/b8357b6d-6107-4104-b906-292a17d14d48" />

Average ~40% speedup on a kernel by kernel basis

<img width="1705" height="729" alt="Screenshot 2025-08-21 at 5 50 32 PM" src="https://github.com/user-attachments/assets/a9715a2b-9e6c-4b33-ba9f-7870dc561e31" />

Differential Revision: [D80630416](https://our.internmc.facebook.com/intern/diff/D80630416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159093
Approved by: https://github.com/jansel
2025-08-21 22:02:49 +00:00
d1faf2ef04 [DTensor] Make default RNG semantics match user-passed generator (#160482)
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.

After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).

The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator.  The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.

Fixes #159991

confirmed docs rendered OK

<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
2025-08-21 22:02:16 +00:00
cc2b65a91a [VLLM]setup test cli logics (#160361)
setup vllm test logics.
1.  install wheels generated from previous build stage
2. generate and install vllm test pkg list on run time based on the torch wheels in the instance
3. run test based on the pre-defined test plan

notice the test-plan format is temporary for some basic vllm testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160361
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-08-21 21:59:41 +00:00
67fc16c744 Add profiler analysis flag to combine multiple profiles into one (#161145)
Combine multiple profiles into one:
```
python profile_analysis.py --combine <file1> <file2> ... <out>
```
This only works well if they have different pids, like from different programs in a distributed run.

<img width="1521" height="465" alt="combining_multiple_profiles" src="https://github.com/user-attachments/assets/aba7112b-e9a9-4075-b82b-a4e4408384da" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161145
Approved by: https://github.com/xmfan
2025-08-21 21:36:58 +00:00
fb241d0a44 [dcp][hf] Fix multi-rank consolidation for no files to process case (#160660)
Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.

Test Plan:
tested this case on a job e2e
added a unit test

Rollback Plan:

Differential Revision: D80273616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660
Approved by: https://github.com/sibuachu
2025-08-21 21:18:03 +00:00
d2b8c0d431 forward fix of #152198 (#161166)
torch._inductor.virtualized.OpsValue objects instance does not have shape attribute. This breaks the fp8 test on ROCm. Add the OpsValue class in todo list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161166
Approved by: https://github.com/jeffdaily
2025-08-21 21:09:48 +00:00
e25ee0290e Fix constant_pad_nd_mps bug when pad is empty (#161149)
Fixes #161066

There is a size check here, which causes the error.
8ce81bcee1/aten/src/ATen/native/mps/operations/Pad.mm (L39-L40)

If the argument `pad` is empty, it will return the cloned tensor on CPU.

8ce81bcee1/aten/src/ATen/native/PadNd.cpp (L43-L64)

Therefore, this PR fixes the empty padding argument error by checking the size first and returning a cloned tensor immediately if the padding size is 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161149
Approved by: https://github.com/malfet
2025-08-21 20:45:26 +00:00
5805c4210b [invoke_subgraph][inductor] Thread graphsafe rng input states for hops (#160713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160713
Approved by: https://github.com/eellison
2025-08-21 20:41:29 +00:00
db38c44ad6 [inductor] add libraries_dirs for level_zero (#161146)
Changes:
1. change set `include_dirs` to append value.
2. add append `libraries_dirs` for level_zero.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161146
Approved by: https://github.com/angelayi
2025-08-21 19:55:12 +00:00
1e3fe78a10 [inductor] disable min/max macro on Windows. (#161133)
Disable min/max macro on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161133
Approved by: https://github.com/angelayi
2025-08-21 19:52:56 +00:00
a445b41e4f [pytorch] Simplify PyTorch foreach_* API restrictions check (#161039)
Summary: C++'s polymorphism and reusing components help us reduce the amount of bolierplate codes here.

Test Plan:
CI & tests

Rollback Plan:

Differential Revision: D80594353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161039
Approved by: https://github.com/janeyx99
2025-08-21 19:50:02 +00:00
801851086d [pytorch] Invoke vector.reserve() consistently for non-inplace foreach operations (#161128)
Summary:
The `reserve()` method is used to pre-allocate memory for the result vector before adding elements to it. This is an optimization that makes sense for several reasons:

1. Performance improvement: By pre-allocating memory for the exact number of elements needed, it avoids multiple reallocations and memory copies that would occur as the vector grows dynamically.

2. Memory efficiency: It ensures that the vector allocates exactly the amount of memory needed, no more and no less, which is efficient when we know the final size in advance.

3. Reduced overhead: Each reallocation typically involves:
- Allocating a new, larger block of memory
- Copying all existing elements to the new location
- Destroying the old elements
- Deallocating the old memory block
- Consistent performance: Without reservation, vector growth typically follows a geometric progression (like 1, 2, 4, 8, 16...), which can lead to unpredictable performance spikes when reallocation occurs.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80674453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161128
Approved by: https://github.com/Skylion007
2025-08-21 19:43:11 +00:00
958f9ca88e [nativert] oss static kernel tests (#161087)
Summary: att - should be no-op

Test Plan:
buck2 test //caffe2/test/cpp/nativert:static_kernel_ops_tests
Tests finished: Pass 24. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D80216488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161087
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-08-21 19:42:21 +00:00
9668210302 Allow bypasses for Precompile when guards, etc. cannot be serialized (#160902)
This adds a new function `bypass_package` and `CompilePackage.bypass_current_entry()`. This allows us to safely bypass if there are models with unserializable or incompatible parts. When we encounter something incompatible, we'll raise a bypass and ignore that particular code in DynamoCodeEntry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160902
Approved by: https://github.com/zhxchen17
2025-08-21 18:20:42 +00:00
3f5a8e2003 Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084)
Fixes https://github.com/pytorch/pytorch/issues/160988.  The root cause can be found in the same issue.  This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084
Approved by: https://github.com/malfet, https://github.com/zou3519
2025-08-21 17:38:32 +00:00
3dacaf0e1e [aoti-fx] Add meta["val"] metadata (#161019)
Summary: Added a `_set_node_metadata_hook` which automatically adds node.meta["val"] to every new node that gets created under this context.

Test Plan:
` buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_advanced_ops`
https://www.internalfb.com/buck2/866439a2-2ba6-42d1-8e43-508d60456e2e

`buck2 test //mtia/host_runtime/afg/tests:test_dynamic_shapes_basic_ops`
https://www.internalfb.com/intern/testinfra/testrun/11540474149662857

Rollback Plan:

Differential Revision: D80579336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161019
Approved by: https://github.com/blaine-rister
2025-08-21 16:45:41 +00:00
a6401cb5aa Revert "flip the list-as-tuple behavior for short lists (#160794)"
This reverts commit febfc3ec03004116dfd6d504e6853ff02a1dd6e0.

Reverted https://github.com/pytorch/pytorch/pull/160794 on behalf of https://github.com/seemethere due to This if failing internal tests, see D80671241 ([comment](https://github.com/pytorch/pytorch/pull/160794#issuecomment-3211314867))
2025-08-21 16:33:30 +00:00
7006fd0c88 Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)"
This reverts commit 517d38d3406abbba35d0694bff259a698cad3ec9.

Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/IvanKobzarev due to Segment tree starts failing on trunk even ciflows/trunk passed on PR ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3211286092))
2025-08-21 16:22:44 +00:00
517d38d340 [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-21 15:45:06 +00:00
3caddd4daa [ROCm] SDPA fix mem fault when dropout is enabled (#154864)
Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-21 14:23:13 +00:00
18271148d3 [dist] expose unsafe_get_ptr for dist.ProcessGroupNCCL.NCCLConfig (#161136)
expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161136
Approved by: https://github.com/kwen2501
2025-08-21 10:47:03 +00:00
a941d7ffe5 [Quant][CPU] Avoid NaN in fp8 output of qlinear and qconv (#160957)
**Summary**
When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip.

**Test plan**
```
pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160957
Approved by: https://github.com/Valentine233, https://github.com/CaoE
2025-08-21 08:36:21 +00:00
acb00d3ccf Revert "Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084)"
This reverts commit cfdaaaaa26d7f34427ba941569eca46f02f79f3e.

Reverted https://github.com/pytorch/pytorch/pull/161084 on behalf of https://github.com/huydhn due to My mistake in not checking for nvidia-smi availability ([comment](https://github.com/pytorch/pytorch/pull/161084#issuecomment-3209498435))
2025-08-21 08:17:04 +00:00
bd5857a1d6 Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)"
This reverts commit 9d18bf01b1661d227f6af41ac07a1e9ef20a9e1a.

Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of failures showing up after this lands ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3209487237))
2025-08-21 08:13:33 +00:00
23b033452f [Inductor][CPP] Fix layout for local buf in outer loop fusion (#160857)
Fixes #159154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160857
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-21 06:00:04 +00:00
2f50ae7d20 [nativert] make runtime const folding aware of run_const_graph (#160760)
Summary: it's possible that we have foldable nodes that use things that will be folded by run_const_graph

Test Plan:
CI

Rollback Plan:

Differential Revision: D80355542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160760
Approved by: https://github.com/SherlockNoMad
2025-08-21 05:22:03 +00:00
9d18bf01b1 [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113)
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-21 05:19:38 +00:00
67b98da1b2 [nativert] oss static kernel test utils (#161086)
Summary: att - should be a no-op

Test Plan:
ci

Rollback Plan:

Differential Revision: D80214768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161086
Approved by: https://github.com/georgiaphillips
2025-08-21 04:49:06 +00:00
b0420d2438 [vllm hash update] update the pinned vllm hash (#161121)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161121
Approved by: https://github.com/pytorchbot
2025-08-21 04:21:09 +00:00
6096d277c5 [audio hash update] update the pinned audio hash (#161021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161021
Approved by: https://github.com/pytorchbot
2025-08-21 04:20:56 +00:00
cfdaaaaa26 Fix torchaudio build when TORCH_CUDA_ARCH_LIST is not set (#161084)
Fixes https://github.com/pytorch/pytorch/issues/160988.  The root cause can be found in the same issue.  This fix ensures that when reuse old wheel is on and `torchaudio` wheel is not there, the inductor test job can still rebuild the wheel it needs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161084
Approved by: https://github.com/malfet, https://github.com/zou3519
2025-08-21 03:47:15 +00:00
117f11adb4 [FlexAttention][TF32] Handle uninitialized torch.backends.cuda.matmul.fp32_precision (#161102)
For https://github.com/pytorch/pytorch/issues/161022
The warning says the old API will be deprecated in 2.9+ anyway, leaving it up to the author of #125888 to decide on initialization behavior then

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161102
Approved by: https://github.com/ngimel, https://github.com/drisspg, https://github.com/BoyuanFeng
2025-08-21 03:36:52 +00:00
a154c2093c remove redundant installation (#160634)
Fixes #160302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160634
Approved by: https://github.com/sekyondaMeta, https://github.com/malfet
2025-08-21 03:31:12 +00:00
39862acb2e [CPU][Inductor] improve performance of A16W4 GEMM template (#159127)
**Summary**
This PR improves performance of A16W4 GEMM template by removing boundary check of prefetch in the kernel code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159127
Approved by: https://github.com/CaoE
2025-08-21 03:16:26 +00:00
9a41570199 [rfc] add hint_override kwarg to mark_dynamic (#161007)
The motivation for this change can be seen through the following example:

```
import torch

GPU_TYPE = "cuda"

@torch.compile
def no_override(x):
    return x.sum(dim=0)

@torch.compile
def override(x):
    return x.sum(dim=0)

x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```

Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:

```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
    xnumel = 16384
    rnumel = r0_numel
```

With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:

```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
    xnumel = 1024000
    rnumel = r0_numel
```

This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:

```
f(s0) -> f(s2)
f(s1) -> f(s2)
```

could generate different kernels. With the new approach, an explicit override pins the chosen configuration:

```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```

ensuring consistent kernel generation regardless of input order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
2025-08-21 02:22:52 +00:00
f9875166a9 Revert "[FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136)"
This reverts commit 3d126e17e0c2630031e7a359d6a6fd1dbe52c4f7.

Reverted https://github.com/pytorch/pytorch/pull/160136 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))
2025-08-21 01:34:19 +00:00
6b5be1f4a0 Revert "[FSDP][Replicate] replicate tests for param registration and input device movements (#160147)"
This reverts commit a3a82e3da85a53afc4bbf3d75bd3d3dcc2e06645.

Reverted https://github.com/pytorch/pytorch/pull/160147 on behalf of https://github.com/jithunnair-amd due to Sorry, but looks like this broke ROCm distributed CI ([comment](https://github.com/pytorch/pytorch/pull/160136#issuecomment-3208632921))
2025-08-21 01:34:19 +00:00
0924304e72 [AOTI] Add a new config cpp.use_constexpr_for_int_array (#160927)
Summary: Default True so same as before, but make it configurable

Differential Revision: D80185094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160927
Approved by: https://github.com/henryoier
2025-08-21 01:16:27 +00:00
d875d3ca1e don't try to set lazy module loading env var (#161103)
This is not needed on drivers >=525, and in DriverAPI::get() we are initializing the context anyway, so setting environment variable after that is beside the point
As a result of calling DriverAPI::get on systems that don't have gpus available (e.g. due to CUDA_VISIBLE_DEVICES="") people were getting confusing errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161103
Approved by: https://github.com/eqy, https://github.com/malfet
2025-08-21 01:06:51 +00:00
a825557ed5 Workaround ATen SFINAE under libc++ (#161101)
The existing logic here to workaround dealing with SFINAE under Microsoft platforms also applies to libc++ platforms. It appears that nvcc reports ambiguity in overload resolution for `pow_`. This seems like a nvcc limitation.

```
fbcode/caffe2/aten/src/ATen/native/cuda/Pow.cuh(42): error: more than one instance of overloaded function "pow" matches the argument list:
            function template "std::__2::enable_if<<expression>, std::__2::__promote<_A1, _A2, void>>::type::type pow(_A1, _A2) noexcept" (declared at line 848 of fbcode/third-party-buck/platform010-libcxx/build/libcxx/include/c++/v1/math.h)
            function template "std::__2::enable_if<<expression>, std::__2::__promote<_Tp, _Up, void>>::type pow(_Tp, _Up) noexcept" (declared at line 11308 of fbcode/third-party-buck/platform010/build/cuda/12.4/bin/..//include/crt/math_functions.h)
            argument types are: (double, float)
    return ::pow(base, exp);
           ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161101
Approved by: https://github.com/malfet
2025-08-21 00:55:58 +00:00
3e3e83418d [BE] Move indexing tests to test_indexing (#160994)
Which enables them on MPS device
- xfail all `test_index_reduce` on MPS, as op is not implemented
- xfail all `test_index_copy` on MPS due to the silent correctness problems, see https://github.com/pytorch/pytorch/issues/160993
- Fixed hard crash in `index_fill` and replaced `skipIfMPS` with `expectedFailueMPS`
- Created issue for the lack of deterministic algorithms for MPS backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160994
Approved by: https://github.com/manuelcandales
ghstack dependencies: #160850, #160889, #160926
2025-08-21 00:42:55 +00:00
667245dc60 TritonKernel.inductor_meta_common() -> self.inductor_meta_common() (#160895)
Summary: use `self.inductor_meta_common()` to call the static method, since the custom subclasses may overwrite the method to be an instance method

Test Plan:
```
caffe2/test/inductor:select_algorithm -- test_finalized_subclass_hooks
```

Rollback Plan:

Differential Revision: D80375351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160895
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2025-08-21 00:22:51 +00:00
54c2b66592 Replace _device_t with torch.types.Device in torch/cpu/__init__.py (#161031)
Fixes #152952

Replace `_device_t` with `torch.types.Device` in `torch/cpu/__init__.py`. Did basic smoke test by running tests that `import torch.cpu` including `test/distributed/test_c10d_functional_native.py` and `test/test_decomp.py`.

Based this PR off of #152935 which is referenced in the main issue.

(also, this is my first contribution but I followed the contributing guide closely)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161031
Approved by: https://github.com/janeyx99
2025-08-21 00:22:43 +00:00
be87f22dfb [inductor] Enable updated __cplusplus macro (#161064)
Intel oneAPI has some header depends on `__cplusplus` macro.
This PR is enable updated __cplusplus macro for msvc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161064
Approved by: https://github.com/angelayi
2025-08-21 00:17:08 +00:00
2a7a7ad711 [inductor] add level zero for xpu (#161061)
Add level zero for Inductor xpu on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161061
Approved by: https://github.com/angelayi
2025-08-21 00:14:15 +00:00
7e6ce41555 [dcp_poc] add async checkpointing tests (#161034)
Summary: add tests for async checkpointer for the experimental checkpointer

Test Plan:
tests

Rollback Plan:

Differential Revision: D80590461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161034
Approved by: https://github.com/pradeepfn
2025-08-21 00:08:53 +00:00
4ed3184dee Conditionally enable ACL for bmm_out_or_baddbmm_ (#161065)
Summary: Similar to #ifdef checks added in addmm_impl_cpu_ to conditionally enable ACL, we add the same checks in bmm_out_or_baddbmm_. This essentially disables ACL for bmm_out_or_baddbmm_ and enables ArmPL, which seems to be performing better.

Test Plan: AR SL

Rollback Plan:

Reviewed By: Nicoshev

Differential Revision: D80494623

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161065
Approved by: https://github.com/q10
2025-08-20 23:32:25 +00:00
44549c7146 [dynamic shapes] unbacked-safe slicing (#157944)
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-20 22:52:56 +00:00
febfc3ec03 flip the list-as-tuple behavior for short lists (#160794)
Per title, previously we started throwing noisy warnings, but given how popular this pattern was in our test suite decided to leave it as warning, not as silent behavior change for one release.
Now `treatSequenceAsTuple` would return `true` in the only case where the sequence was indeed a tuple, so no need for a special function anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160794
Approved by: https://github.com/albanD
2025-08-20 22:40:42 +00:00
5afa4187df Close some sources of fake tensor leakages (#159923)
Differential Revision: D79694055

Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and error using the FQN of the lifted constant
2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict
3. We modify yolov3 to fix the previous silent incorrect behaviour

When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159923
Approved by: https://github.com/avikchaudhuri
2025-08-20 22:24:23 +00:00
30384abcb1 Decrease number of bytes used by uninitialized tokens_ in KernelFunction (#160764)
std::unique_ptr to decrease bytes from 24 to 8

Since std::unique_ptr is not copyable this required defining the copy / copy assignment constructors. Which made me realize we shouldn't be copying `tokens_` in those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160764
Approved by: https://github.com/albanD
2025-08-20 21:33:27 +00:00
16e811e0b5 [CI] remove tb-nightly (#160996)
Removing tb-nightly because we found issues when importing tensorboard as having both tb-nightly and tensorboard causes issues when pip would report 2.18.0 (pinned tensorboard) but importing in a python shell would report 2.13.XXX. This mismatch causes issues when running tests in a numpy2.X environment. e.g.

```
/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
/opt/venv/lib/python3.12/site-packages/redis/connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
  warnings.warn(msg)
/opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)
E
======================================================================
ERROR: test_event_handler (__main__.TestMonitorTensorboard.test_event_handler)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_monitor.py", line 116, in setUp
    from tensorboard.backend.event_processing import (
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 25, in <module>
    from tensorboard.backend.event_processing import (
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 25, in <module>
    from tensorboard.backend.event_processing import event_file_loader
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/event_file_loader.py", line 21, in <module>
    from tensorboard import dataclass_compat
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/dataclass_compat.py", line 33, in <module>
    from tensorboard.plugins.hparams import metadata as hparams_metadata
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/plugins/hparams/metadata.py", line 32, in <module>
    NULL_TENSOR = tensor_util.make_tensor_proto(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/util/tensor_util.py", line 405, in make_tensor_proto
    numpy_dtype = dtypes.as_dtype(nparray.dtype)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py", line 677, in as_dtype
    if type_value.type == np.string_ or type_value.type == np.unicode_:
                          ^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/numpy/__init__.py", line 400, in __getattr__
    raise AttributeError(
AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

----------------------------------------------------------------------
Ran 1 test in 0.355s

FAILED (errors=1)

```
After removing tb-nightly and ensuring that tensorboard 2.18.0 is the only tensoboard in the env:

```
root@rocm-framework-47:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
.
----------------------------------------------------------------------
Ran 1 test in 0.409s

OK

```

```
>>> import tensorboard
>>> print(tensorboard.__version__)
2.13.0a20230426
```
```:/# pip show tensorboard
Name: tensorboard
Version: 2.18.0
Summary: TensorBoard lets you watch Tensors Flow
Home-page: https://github.com/tensorflow/tensorboard
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /opt/venv/lib/python3.12/site-packages
Requires: absl-py, grpcio, markdown, numpy, packaging, protobuf, setuptools, six, tensorboard-data-server, werkzeug
Required-by:

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160996
Approved by: https://github.com/huydhn
2025-08-20 21:25:58 +00:00
19c70c2f3d [pytorch] Faster and safer lambda expression capture in has_integral_tensor() (#161042)
Summary: Because `includeBool` is already a small value type (i.e., `bool`, 1 byte) that's passed by value to the function. Capturing by reference (4 or 8 bytes depending on the system) is unnecessary and could potentially lead to dangling reference issues if the lambda outlives the original variable. Capturing by value is more efficient for small types and safer.

Test Plan:
OSS CI & tests

Rollback Plan:

Differential Revision: D80595698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161042
Approved by: https://github.com/Skylion007
2025-08-20 20:59:41 +00:00
8047cde0f3 Try to fix Inductor CI periodic tests (#160932)
- hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count.
- dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work.
- hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932
Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet
2025-08-20 20:36:46 +00:00
24e7f3c21c [ROCm] fix large tensor sort on MI350 (#161054)
Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648

Replace `std::min` to ternary statement
Also `std::min` can be replaced by explicit typing `std::min<int64_t>`

fixes on ROCm:
test_sort_and_select.py::TestSortAndSelectCUDA::test_sort_large_cuda_float16
error:
RuntimeError: Cannot sort dimension of length 8192

Similar PR to fix large tensors on ROCm https://github.com/pytorch/pytorch/pull/130994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161054
Approved by: https://github.com/jeffdaily
2025-08-20 19:58:01 +00:00
e1a64b75ff [CD] Delete full builds (#161075)
As they are no longer needed for Colab, see https://github.com/googlecolab/colabtools/issues/5508#issuecomment-3200871941 and
[<img width="896" height="128" alt="image" src="https://github.com/user-attachments/assets/a287393c-bde7-4e10-99bf-2e0d66346efe" />
](https://colab.research.google.com/drive/1YJ5Y0xsApXSewM1cQwWQ_AS3A77vytgq)

Fixes https://github.com/pytorch/pytorch/issues/160972
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161075
Approved by: https://github.com/atalman
2025-08-20 19:40:15 +00:00
b708966201 Fix bucketing introducing cycles (#160967)
We were just looking at direct arguments, but not transitive dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160967
Approved by: https://github.com/IvanKobzarev
2025-08-20 19:38:46 +00:00
dbef606631 Add support for tracing vmap in pre-dispatch export (#154650)
Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops.

The implementation strategy is:
1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC.
2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API.

Test Plan: CI

Differential Revision: D75623875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-08-20 19:31:07 +00:00
c5cb255625 [inductor][mm] fix tma issue (#161025)
# why

- head is broken

# what

- the template for experimental API is broken
- the test assumes not experimental API

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_persistent_tma_strided_a_transposed_True_b_transposed_False_dynamic_True -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161025
Approved by: https://github.com/PaulZhang12
2025-08-20 18:52:38 +00:00
957b170d8e Fix SVD forward-mode AD multiplication priority (#161027)
Multiplication order priority for the SVD JVP appears to have been the opposite of the optimal one.

Results from a crude CPU benchmark on my laptop for random matrices of various ratios:

```
  Performance Results Table

  | Test Case                        | Matrix Size | Aspect Ratio | Before JVP (ms) | After JVP (ms) | Change (ms) | % Change | Status              |
  |----------------------------------|-------------|--------------|-----------------|----------------|-------------|----------|---------------------|
  | Tall matrix (10:1 ratio)         | 1000×100    | 10:1 tall    | 3.13            | 3.24           | +0.11       | -3.5%    |  Regression        |
  | Tall matrix (10:1 ratio, larger) | 2000×200    | 10:1 tall    | 15.72           | 14.66          | -1.06       | +6.7%    |  Improvement       |
  | Tall matrix (10:1 ratio, large)  | 5000×500    | 10:1 tall    | 105.97          | 101.84         | -4.13       | +3.9%    |  Improvement       |
  | Wide matrix (1:10 ratio)         | 100×1000    | 1:10 wide    | 5.90            | 4.64           | -1.26       | +21.4%   |  Major Improvement |
  | Wide matrix (1:10 ratio, larger) | 200×2000    | 1:10 wide    | 18.29           | 17.78          | -0.51       | +2.8%    |  Improvement       |
  | Wide matrix (1:10 ratio, large)  | 500×5000    | 1:10 wide    | 137.40          | 128.70         | -8.70       | +6.3%    |  Improvement       |
  | Square matrix (baseline)         | 1000×1000   | 1:1 square   | 116.16          | 106.09         | -10.07      | +8.7%    |  Improvement       |
  | Square matrix (larger baseline)  | 2000×2000   | 1:1 square   | 714.30          | 673.23         | -41.07      | +5.7%    |  Improvement       |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161027
Approved by: https://github.com/soulitzer
2025-08-20 18:47:11 +00:00
c02e26bf31 Fix filename showing up as ints in dynamo_compile stack_trace column. (#160916)
Test plan:
$ python -m test_utils

Note:
Another way is adding the actual file_name to from_traceback, but since it's referenced in multiple places and may have associated tests this seems safer. Lmk if changes are needed @c00w

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160916
Approved by: https://github.com/c00w, https://github.com/masnesral
2025-08-20 18:38:38 +00:00
eqy
c74e5f6061 [CUDA] Bump tolerances for test_baddmm (#159915)
Only one mismatch out of the entire result tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159915
Approved by: https://github.com/nWEIdia, https://github.com/drisspg
2025-08-20 18:05:51 +00:00
1471b20cb3 add static dispatch kernel registration to open source (#160439)
Summary: static dispatch registry should be moved to open source. the rest can maintain internally for now, since delegates will all go through ET hop.

Test Plan: spot checked existing tests and didn't see any missing registrations

Differential Revision: D80099377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160439
Approved by: https://github.com/SherlockNoMad, https://github.com/zhxchen17
2025-08-20 17:58:00 +00:00
b2632e7982 Fix error message for fsdp_pre_all_gather (#160817)
See: 20e40492b0/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L97-L104)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160817
Approved by: https://github.com/weifengpy, https://github.com/H-Huang
2025-08-20 17:43:57 +00:00
5255e65c01 [dynamo] Refactor convert_frame to remove usage of nonlocal tracer output return. [4/n] (#160899)
Today convert_frame is implemented like the following:
```
def _compile():
    tracer_output = None
    def transform():
        nonlocal tracer_output
        ...
    def _compile_inner():
         transform(...)

     compile_inner(...)
```

The code is using unconventional nonlocal variable as the return value. This is not ideal for 2 reasons:
1. Reasoning about the code, especially together with error handling code becomes harder.
2. more importantly, this makes it harder to extract out common code pieces into a shared library because everything must depend on a central global state.

In this diff we remove the usage of nonlocal return and just use the conventional function return to output the compilation data.

Differential Revision: [D80461258](https://our.internmc.facebook.com/intern/diff/D80461258/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160899
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #160814, #160815, #160855
2025-08-20 17:37:26 +00:00
9e050b6339 [dynamo] Refactor convert_frame._compile_inner to return compiled bytecode + output graph. [3/n] (#160855)
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).

This PR adds a new helper function compile_frame() which takes a bytecode and a transform function and return compiled bytecode + output graph as DynamoOutput type.

Differential Revision: [D80430802](https://our.internmc.facebook.com/intern/diff/D80430802/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160855
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #160814, #160815
2025-08-20 17:37:26 +00:00
b3e215b864 Trigger h100 on test_max_autotune, mm, grouped_mm changes (#160678)
Following  @henrylhtsang 's pr here: https://github.com/pytorch/pytorch/pull/160656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160678
Approved by: https://github.com/henrylhtsang, https://github.com/ngimel
2025-08-20 16:56:30 +00:00
e483947047 [BE] Remove intel-openmp dependency in setup.py (#160976)
Fixes #160962

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160976
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-08-20 16:33:16 +00:00
8e17709055 FlexDecode not guarding on GQA groups correctly (#160904)
Addressing #151359

Updates flex_decode dispatch to use flex attention rather than flex decode if number of groups is not a power of 2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160904
Approved by: https://github.com/drisspg
2025-08-20 16:32:16 +00:00
e631557518 Fix meta function for aten.complex (#160894)
Closes https://github.com/pytorch/pytorch/issues/160882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160894
Approved by: https://github.com/mlazos
2025-08-20 16:30:04 +00:00
7f201baf41 Allow exposing more functions during initial template expansion (#159554)
Also adds a `_register_hook` utility, and documents & type annotates `PartialRender`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159554
Approved by: https://github.com/laithsakka, https://github.com/kundaMwiza
2025-08-20 16:08:55 +00:00
ce048de608 [ATen][CPU][Sparse] Use Third-Party Eigen for sparse add and addmm (#155357)
This pull request adds the following ops for sparse matrices using Eigen library:
```python
    add(a_csr, b_csr)
    add(a_csc, b_csc)

    addmm(c_csr, a_csr, b_csr)
    addmm(c_csr, a_csr, b_csc)
    addmm(c_csr, a_csc, b_csc)
    addmm(c_csr, a_csc, b_csr)

    addmm(c_csc, a_csr, b_csr)
    addmm(c_csc, a_csr, b_csc)
    addmm(c_csc, a_csc, b_csc)
    addmm(c_csc, a_csc, b_csr)
```

Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.

This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
2025-08-20 15:44:54 +00:00
90ea9ccefe Revert "[rfc] add hint_override kwarg to mark_dynamic (#161007)"
This reverts commit 0533ff2ccba7e77622ac3c6758f1032bdc10feff.

Reverted https://github.com/pytorch/pytorch/pull/161007 on behalf of https://github.com/jeffdaily due to failing on both cuda and rocm ([comment](https://github.com/pytorch/pytorch/pull/161007#issuecomment-3206893756))
2025-08-20 15:31:33 +00:00
6ea4be1e2e Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 2f0cba934de7094a66c6ce68f5e937254f23142a.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/seemethere due to This is blocking internal sync due to merge conflicts ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3206833193))
2025-08-20 15:16:45 +00:00
a818fa77e3 Back out "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" (#160999)
Summary: reverting this diff since it caused S551328. Please see D80217492 for dertails.

Test Plan:
NA

Rollback Plan:

Differential Revision: D80553314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160999
Approved by: https://github.com/izaitsevfb, https://github.com/jingsh
2025-08-20 15:04:36 +00:00
5ee464db5c [inductor] Fix descriptor broadcasting for singleton dimensions (#160310)
This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail.

As an example, consider the following store instruction:

```
name=buf1
index=x2 + 192*y0 + 64*y1
valule=TritonCSEVariable('tmp7')
params = BlockParameters(
    shape=[3, 4, 1, 1, 64],
    block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK],
    strides=[64, 192, 0, 0, 1],
    offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset]
)
broadcasting_dims=[False, False, True, True, False]
broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK]
```
Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly
marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`.

```
9             pre_broadcast_shape = [
280                 sympy.S.One if is_broadcasting else dim
281                 for dim, is_broadcasting in zip(
282  ->                 self.broadcast_shape, self.broadcasting_dims
283                 )
284             ]
```

The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]`

Triton throws the following error: `reshape() cannot change total number of elements in tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310
Approved by: https://github.com/blaine-rister
2025-08-20 09:48:58 +00:00
0533ff2ccb [rfc] add hint_override kwarg to mark_dynamic (#161007)
The motivation for this change can be seen through the following example:

```
import torch

GPU_TYPE = "cuda"

@torch.compile
def no_override(x):
    return x.sum(dim=0)

@torch.compile
def override(x):
    return x.sum(dim=0)

x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```

Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:

```
def triton_per_fused_sum_1(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr):
    xnumel = 512
    r0_numel = 32
```

With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:

```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
    xnumel = 16384
    r0_numel = 128
```

This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:

```
f(s0) -> f(s2)
f(s1) -> f(s2)
```

could generate different kernels. With the new approach, an explicit override pins the chosen configuration:

```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```

ensuring consistent kernel generation regardless of input order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
2025-08-20 07:51:09 +00:00
a9fabeb012 [BE] Fix old TMA API in persistent matmul template (#161030)
Summary: Fixes a bug introduced by https://github.com/pytorch/pytorch/pull/159407

Test Plan:
NA

Rollback Plan:

Differential Revision: D80588320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161030
Approved by: https://github.com/adamomainz, https://github.com/NikhilAPatel, https://github.com/nmacchioni, https://github.com/aakhundov
2025-08-20 05:53:57 +00:00
0f801a510f Using std::vector or c10::SmallVector instead of CArray (#160959)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160959
Approved by: https://github.com/Skylion007
2025-08-20 05:32:29 +00:00
576a0e64ed [nativert] ensure that moveable outputs are set in other executionframe ctor (#161005)
Summary:
so we use this constructor in HigherOrderKernel. problems arise in the loop condition, where it's possible for an output from the prev. iteration to be an input to the next. so the Output(N) of a kernel may be the Input(M) to a kernel in the next iteration. Thus, if the output value is reset (via. fastresizetozero) or overwritten by a prev. kernel before it is to be used, we have major major issues.

we need to enforce that outputs are moved, not copied, to ensure this doesn't happen.

Test Plan:
buck2 test //caffe2/test:test_export --local-only -- test_while_loop_tensor_constant_idx_cpp_runtime_nonstrict

Rollback Plan:

Differential Revision: D80565374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161005
Approved by: https://github.com/SherlockNoMad
2025-08-20 05:05:32 +00:00
a3fe1ced40 [Optimus][decompose_mm] Fix BooleanAtom corner case (#160987)
Summary:
We observe a case where the BooleanAtom does not support regular sum op for bool exp, thus we fix it by using bool()

Rollback Plan:

Differential Revision: D80550876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160987
Approved by: https://github.com/Yuzhen11, https://github.com/mlazos
2025-08-20 04:36:12 +00:00
7e4bfa74ea [vllm hash update] update the pinned vllm hash (#161020)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161020
Approved by: https://github.com/pytorchbot
2025-08-20 04:15:50 +00:00
d8fcb2a4ac [dcp_poc] Fix parameter order in distributed checkpoint API to use path-first for consistency (#160986)
Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns.

Test Plan:
sandcastle tests

Rollback Plan:

Differential Revision: D80549014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986
Approved by: https://github.com/pradeepfn
2025-08-20 04:09:18 +00:00
2b62ef7420 Add kernel information JSON generation for AOTI packages (#160540)
Summary:
Build on D80031559. Generate kernel_information.json in AOTI compiled artifacts by combining stack traces and node mappings from provenance tracking.

This implementation delivers exactly what Zoomer team requested:

**1. Core Function**: `create_kernel_information_json()` in debug.py combines 3 data sources:
- `_inductor_kernel_stack_trace` → `stack_traces` field
- `_inductor_triton_kernel_to_post_grad_node_info` → `post_grad_nodes` field
- `_inductor_post_to_pre_grad_nodes["postToPre"]` → `pre_grad_nodes` field

**2. AOTI Integration**: codecache.py writes `kernel_information.json` to pt2 packages when both AOTI packaging and provenance tracking are enabled.

**3. Test Coverage**: TestKernelInformationAOTI class validates:
- JSON file creation in AOTI packages using zipfile
- Exact format compliance
- Proper disabling without provenance tracking

**Output Format** (exact specification):
```json
{
  "triton_kernel_name_1": {
    "stack_traces": [str, str, ...],
    "post_grad_nodes": [str, str, ...],
    "pre_grad_nodes": [str, str, ...]
  }
}
```

Test Plan:
```
buck test fbcode//caffe2/test/inductor:provenance_tracing -- TestKernelInformationAOTI
```

Manual validation:
```python
import torch
model = torch.nn.Linear(10, 1)
with torch._inductor.config.patch("aot_inductor.package", True):
    with torch._inductor.config.patch("trace.basic_provenance_tracking", True):
        # AOTI compilation should generate kernel_information.json
        compiled = torch.export.export(model, (torch.randn(1, 10),))
```
---

Rollback Plan:

Differential Revision: D80139160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160540
Approved by: https://github.com/yushangdi
2025-08-20 02:33:45 +00:00
54cc63b467 [BE][Dynamo] Type coverage for symbolic_convert (#160922)
As part of better engineering, we add type coverage to `dynamo/symbolic_convert.py`, which is the main work engine of dynamo for emulating python bytecode.

Running
```
mypy torch/_dynamo/symbolic_convert.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  764 | 4286 | 17.83% | 43 | 241 | 17.84% |
| This PR | 4322 | 4322 | 100.00% | 241 | 241 | 100.00% |
| Delta    | +3558 | +36 | +82.17% | +198 | 0 | +82.16% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160922
Approved by: https://github.com/StrongerXi
2025-08-20 01:24:31 +00:00
599f639ddb [dynamo] Refactor transform() so that instruction translator can be used as a tracing function. [2/n] (#160815)
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).

This PR follows the last one which separate out the part to run instruction translator on a given frame and return a DynamoTracerOutput.

The end result is a free function that runs instruction translator indepedently. A follow up diff will wrap the low level function.

Differential Revision: [D80388694](https://our.internmc.facebook.com/intern/diff/D80388694/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160815
Approved by: https://github.com/anijain2305
ghstack dependencies: #160814
2025-08-20 01:16:35 +00:00
72e4786d16 [dynamo][dist] trace DeviceMesh's get_local_rank and get_rank as constants (#160805)
Used in https://github.com/pytorch/torchtitan/pull/1555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160805
Approved by: https://github.com/StrongerXi, https://github.com/mlazos
2025-08-20 01:12:24 +00:00
371909cfd1 [Inductor][CPP] Add float16 support for CppMicroGemmAMX (#147368)
Add float16 support for CppMicroGemmAMX for float16 gemm template. Float16 CppMicroGemmAMX needs a higher version of compiler, e.g., GCC 13.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147368
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-20 01:04:05 +00:00
78a8e6a671 Add new_empty (with dtype argument only) to torch::stable (#159508)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159508
Approved by: https://github.com/janeyx99
ghstack dependencies: #160557
2025-08-20 00:50:42 +00:00
543896fcf3 test_matmul_cuda: Refine MX test skipping (#161009)
Replace return unittest.skip with raise unittest.SkipTest to ensure that the test suite correctly reports skipped tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161009
Approved by: https://github.com/jeffdaily
2025-08-20 00:47:45 +00:00
a3a82e3da8 [FSDP][Replicate] replicate tests for param registration and input device movements (#160147)
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model.

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device
2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160147
Approved by: https://github.com/weifengpy
ghstack dependencies: #160135, #160136
2025-08-20 00:47:00 +00:00
9d7cecdd6c [SymmMem] Support rendezvous on view of a tensor (#160925)
`tensor.view` share the same `data_ptr()` as the original tensor, thus cannot serve as key to rendezvous' map (we want a 1:1 match between handle and tensor, thus need a unique key).

@ezyang suggests using the raw `TensorImpl*` of a tensor, for which `tensor.view` would have a different value than the original tensor.

But the raw `TensorImpl*` can be stumbled on again when a previous tensor gets deallocated and a new one allocated. For that reason, we'd also need to use a `weak_instrusive_ptr` to distinguish the two tensors, i.e. for the deallocated tensor, `weak_instrusive_ptr::expired()` would return true.

Added `test_rendezvous_view` and `test_rendezvous_same`.

Note: the view support has been added to NVSHMEM backend and NCCL backend. For CUDA backend, I have yet to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160925
Approved by: https://github.com/ngimel
ghstack dependencies: #160825
2025-08-19 23:49:25 +00:00
0d19541284 fabric detection - fix build on an old toolkit (#160984)
Fixes #160960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160984
Approved by: https://github.com/eqy
2025-08-19 23:43:36 +00:00
eqy
e836323a23 [FP8][cuBLAS][SM100] cuBLAS doesn't support rowwise-scaling on sm100 (#160693)
See also: https://docs.nvidia.com/cuda/cublas/#id93

Only tensor-wide scales and 1D scales with tiled layout are supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160693
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-08-19 23:22:51 +00:00
512fc768e9 Add tlparse artifact for joint graph passes (for inference & non-freezing only) (#160589)
Summary:
Joint graph passes run several FX passes which can modify the graph before it hits Inductor.

There's three usages of joint graph passes:
- **for inference & not freezing** (we add structured loggings only for this)
- for inference & freezing
- for fw/bw split

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D80130321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160589
Approved by: https://github.com/yushangdi
2025-08-19 23:18:40 +00:00
a7b5955ea8 [ContextParallel] add Document Masking test (#160700)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #160700

**Summary**
add test case to CP + FlexAttention for Document Masking

**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention_document_mask`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160700
Approved by: https://github.com/fegin
2025-08-19 23:03:18 +00:00
e83825f91c Revert "handling special case for pow(3) for GPU (#157537)"
This reverts commit 05e8fac4f374c4dbf0cd0e85e925e9112cf234a2.

Reverted https://github.com/pytorch/pytorch/pull/157537 on behalf of https://github.com/malfet due to This is really really bad from performance point of view, wonder if any benchmarks will detect that ([comment](https://github.com/pytorch/pytorch/pull/157537#issuecomment-3202661810))
2025-08-19 22:57:45 +00:00
33c3794533 [dynamic shapes] use prims_common contiguity in create_example_tensors (#160933)
Summary: forward fix T234739699

Test Plan:
T234739699

Rollback Plan:

Differential Revision: D80503451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160933
Approved by: https://github.com/henrylhtsang
2025-08-19 22:43:13 +00:00
8f766d6839 Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557)
TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes.

This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear.

Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t).

I then changed the test to test the scalar_type API.

This code change required some refactoring because of circular dependencies.

## BC Breaking note
This commit is (narrowly) BC-breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2025-08-19 22:13:47 +00:00
05e8fac4f3 handling special case for pow(3) for GPU (#157537)
follows #152373

Special case for pow(3):
Similar to the [CPU kernel](d27d36136c/aten/src/ATen/native/cpu/PowKernel.cpp (L64)), added corresponding GPU code for numerical stability.

issue #150951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157537
Approved by: https://github.com/soulitzer
2025-08-19 21:57:08 +00:00
f90ccad165 [export] Relax FC requirement of serde.deserialize by allowing unknown fields. (#160918)
Summary:
Previously we will pass all serialized data to dataclass ctors.
Now we just loop over all the existing fields in dataclass and fetch only the field we need to run ctor.

This should help with the case when we deserializing a buffer with new field.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80487716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160918
Approved by: https://github.com/angelayi
2025-08-19 21:54:46 +00:00
35e4d97e04 [dynamo] Support builtin complex with constant args (#160799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160799
Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos
2025-08-19 20:38:54 +00:00
66166cf1e7 preserve node meta to fix inductor generated kernel name for pattern matched graphs (#160542)
Summary:
When using inductor pattern matcher to replace graphs, the graph generated by replacement function can be missing `original_aten` metadata for the replaced nodes.  This further results in inductor failing to generate a sensible kernel name, eg. `tri_poi_fused_0` , missing the aten op name.

This diff attempts to fix that by allowing tracing the graph in replacement function with `preserve_node_meta`. Included this as an option to turn on in `pattern_matcher.fwd_only` function.

Can confirm that with the fix, MTIA's pattern matcher replaced original graph with a node that has original_aten meta, and inductor generated kernel name has op name.

Test Plan:
added kernel_name check to afg_inductor_test silu test

Rollback Plan:

Differential Revision: D80183670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160542
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2025-08-19 20:32:17 +00:00
eba20d2d74 Revert "[WIP] Merge Test (#160998)"
This reverts commit ef761c43538abae5bccc0c4b6ebaf42ff676db7a.

Reverted https://github.com/pytorch/pytorch/pull/160998 on behalf of https://github.com/ZainRizvi due to Undoing test merge ([comment](https://github.com/pytorch/pytorch/pull/160998#issuecomment-3202125839))
2025-08-19 20:30:39 +00:00
ef761c4353 [WIP] Merge Test (#160998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160998
Approved by: https://github.com/ZainRizvi
2025-08-19 20:26:07 +00:00
1ea918caf9 [C10D] Make MultiProcContinuousTest less spammy (#160821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160821
Approved by: https://github.com/fduwjj
ghstack dependencies: #160892
2025-08-19 20:17:19 +00:00
779fc29c04 [C10D] Fix spelling of MultiProcContinuousTest (#160892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892
Approved by: https://github.com/fduwjj
2025-08-19 20:17:19 +00:00
ed8bcccf31 [BE][Ez]: Update ruff to 0.12.9 (#160896)
Updates ruff. Fixes false positives and other miscellaneous ruff linting and formatting fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160896
Approved by: https://github.com/zou3519
2025-08-19 19:56:24 +00:00
9d9cc9897a [SymmMem] Support rendezvous on slice of a tensor (#160825)
When we search for a NVSHMEM allocation backing a tensor, don't limit it to an exact match between `tensor.data_ptr()` and `allocation.base_ptr`. Instead, test whether the former is within an allocation range, i.e. [base_ptr, base_ptr + size).

This PR also squashed in original base PR #160795:
Since (i) `handle = rendezvous(tensor)`, and (ii) we pass `handle->buffer_ptrs` to kernels, `handle` should carry the `data_ptr()` of tensor instead of the base address of a memory allocation (previous case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160825
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-08-19 19:08:45 +00:00
65d21dae18 [inductor] dont reuse buffers if it affects peak (#145883) (#159530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530
Approved by: https://github.com/eellison
2025-08-19 19:02:56 +00:00
62db8ec391 windows python 3.14 nightly builds (#159869)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159869
Approved by: https://github.com/malfet, https://github.com/williamwen42
2025-08-19 18:36:16 +00:00
5dad5b4f57 [AIDIR] Revise the insight content (#160649)
Summary:
Make it more descriptive and understable to user.

Rollback Plan:

Differential Revision: D80218659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160649
Approved by: https://github.com/jingsh
2025-08-19 18:04:49 +00:00
fab5dac734 Tweak dependabot to run inductor jobs (#160935)
After https://github.com/pytorch/pytorch/pull/160635, I can see dependabot creating the PR to bump `transformers` version at https://github.com/pytorch/pytorch/pull/160807.  This a good start, but there are several tweaks we need:

1. Run inductor tests on the PR including one round of perf benchmark, which is always needed.  So, we need `ciflow/inductor` label and a `pull_request` trigger for the benchmark
2. Per @anijain2305 feedback, we don't need to update patch version.  So, I add a rule to ignore it.  Again, we would need to test this out after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160935
Approved by: https://github.com/anijain2305
2025-08-19 17:56:07 +00:00
a44a0d3671 [MPS] Fix index_add for complex + int64 (#160926)
By re-using deterministic algorithm from
bbc7c03e93/aten/src/ATen/native/cuda/Indexing.cu (L1106-L1113)

Fixes https://github.com/pytorch/pytorch/issues/160845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160926
Approved by: https://github.com/manuelcandales
ghstack dependencies: #160850, #160889
2025-08-19 17:43:06 +00:00
2f0cba934d [dynamic shapes] unbacked-safe slicing (#157944)
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-19 17:32:47 +00:00
0a5ab612dd Port amax to stable ABI (#160214)
To enable porting torchaudio to the stable ABI, we need the `amax` operation to be accessible. This PR ports the op and provides tests that it behaves correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160214
Approved by: https://github.com/mikaylagawarecki
2025-08-19 17:24:53 +00:00
1fbe230b0d forward fix #160747 (#160981)
broke rocm inductor tests

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160981
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-19 17:16:41 +00:00
eddaaa6c2a Revert "Recheck Autotune cache on Precompile serialization to prune compilation results (#158656)"
This reverts commit 664005662ad8c9aa1942015397048aa9ca14fd6d.

Reverted https://github.com/pytorch/pytorch/pull/158656 on behalf of https://github.com/seemethere due to failing internal tests, see D80486843 ([comment](https://github.com/pytorch/pytorch/pull/158656#issuecomment-3201491561))
2025-08-19 16:53:20 +00:00
fecc5f6001 [codemod] Fix unused-local-typedef issue in caffe2/aten/src/ATen/native/cuda/CUDALoops.cuh +2 (#160944)
Summary:
LLVM has a warning `-Wunused-local-typedef` which we are enabling to remove unused code. This has the side-effect of making it easier to do refactors should as removing unnecessary includes.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan:
Sandcastle

Rollback Plan:

Differential Revision: D80511128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160944
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-19 16:49:29 +00:00
f305019377 [inductor] propagate shapes in CSEVariable (#152198)
Fixes #149905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198
Approved by: https://github.com/eellison
2025-08-19 16:46:38 +00:00
50cfe76231 Update checkpoint warning to target PyTorch 2.9 (#160725)
Follow-up to #160534. Fixes the docstrings and the warning in checkpoint_sequential, which presumably should have same deprecation notice
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160725
Approved by: https://github.com/soulitzer
2025-08-19 15:08:50 +00:00
9225c61994 Move save guard error throwing to separate phase (#160662)
This diff makes it so that the portion saving guards that can throw is completely separated from GuardBuilder, and instead in `serialize_guards`. This lets me add a try catch around it for caching precompile later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160662
Approved by: https://github.com/zhxchen17
2025-08-19 14:46:43 +00:00
e3ebf364e6 Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)"
This reverts commit 5d9653d90ee003173dd03f93e09fed236500ef06.

Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/malfet due to It broke inductor tests by improving them ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3200834103))
2025-08-19 13:46:53 +00:00
284b719005 Remove the uncessary empty file (#160728)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728
Approved by: https://github.com/Skylion007
2025-08-19 10:54:08 +00:00
daeb3a6094 Using std::make_unique<T>() instead of unique<T>(new T()) (#160723)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160723
Approved by: https://github.com/Skylion007
2025-08-19 10:25:47 +00:00
cyy
5d9653d90e Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)
Because numpy 1.22.4 had reached EOL 3 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836
Approved by: https://github.com/malfet
2025-08-19 09:15:06 +00:00
df60736410 [BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
2025-08-19 07:32:55 +00:00
8f31aa97a3 [dynamo] [guard] Add caching for inside torch.compile.disable function to avoid unnecessary recompilation. (#160934)
Fixes #157399
cherry pick of d6a5c03

@mlazos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160934
Approved by: https://github.com/mlazos
2025-08-19 06:01:26 +00:00
29afde2020 [CD] Build libtorch without nvshmem (#160910)
It was done once for cuSparseLT in f01d7105b1 , now it's nvShmem's time

Fixes https://github.com/pytorch/pytorch/issues/160762
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160910
Approved by: https://github.com/Skylion007
2025-08-19 05:58:25 +00:00
8dbe7f99bd [BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)
allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359).

dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711
Approved by: https://github.com/PaulZhang12, https://github.com/njriasan
2025-08-19 05:27:10 +00:00
1d46aa736f [audio hash update] update the pinned audio hash (#160930)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160930
Approved by: https://github.com/pytorchbot
2025-08-19 04:22:55 +00:00
2cf69fe0e1 [vllm hash update] update the pinned vllm hash (#160929)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160929
Approved by: https://github.com/pytorchbot
2025-08-19 04:22:45 +00:00
923bc46122 fix mul.Scalar with strided tensor (#160560)
Summary: out variant has to be strided like self. since memory format isn't provided, this should be equivalent.

Test Plan:
prev. when we enable static dispatch this test would have numeric issues
```
buck2 test //caffe2/test:test_export -- test__scaled_dot_product_flash_attention_cpp_runtime_nonstrict --print-passing-details
```

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D80191085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160560
Approved by: https://github.com/SherlockNoMad
2025-08-19 04:15:12 +00:00
58f9a3dd63 [ez] Only use default numa bindings if nproc == cuda device count (#160848)
# Context
Another fix to enable broad rollout of #149334.

The implementation assumes that the trainer process with local rank `n` only uses device `cuda:n`. However, there are sometimes jobs with more than one GPU per process, in which case our assumption could be incorrect and actually lead to worse memory locality.

# This PR
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160848
Approved by: https://github.com/kiukchung
2025-08-19 02:50:01 +00:00
a391fa1c42 Make Inductor benchmarker more compatible with Triton do_bench (#160921)
Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden.

I also opened a PR to use this timing measurement function on TritonBench side: https://github.com/meta-pytorch/tritonbench/pull/333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160921
Approved by: https://github.com/BoyuanFeng
2025-08-19 02:40:21 +00:00
209143ddeb [while_loop][inductor] fix aliased inputs by cloning (#160668)
[fx_graph_cse](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/compile_utils.py#L46) is executed in min_cut partitioner which accidentally creates the aliasing for empty buffers and we could see the following graph node for joint graph with cmd: "pytest test/functorch/test_control_flow.py -k test_scan_multiple_layers_gradient_layers_2_device_cpu"
```python
while_loop = torch.ops.higher_order.while_loop(while_loop_cond_graph_0_0, while_loop_body_graph_0_0, (full_default_4, empty_strided_default, full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default, rev, rev_1, rev_2, rev_3), (primals_4, primals_5, primals_6, primals_7));
```

Notice the operands sequence **"full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default"**, which indicates the gradient of different layers now sharing the same buffer, which create silent incorrectness.

Fixes https://github.com/pytorch/pytorch/pull/158168.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160668
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160374
2025-08-19 02:33:59 +00:00
b1380f434d [CD] Disable USE_MPI in XPU CI/CD wheel build (#159135)
XPU wheel build need source MPI for distributed XCCL backend build, but it also enable USE_MPI by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159135
Approved by: https://github.com/malfet
2025-08-19 02:32:03 +00:00
e6e45e6ae8 [FSDP] Use post_reduce_stream.record_event() on hsdp+cpuoffload (#160481)
Fixes https://github.com/pytorch/pytorch/issues/160291
`post_reduce_stream` is `all_reduce_stream` during HSDP, but CPU-GPU sync is hard coded to `reduce_scatter_stream`
The hard-code could fail unit test on HSDP+CPU offload, add unit test here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160481
Approved by: https://github.com/weifengpy
2025-08-19 02:20:14 +00:00
3d126e17e0 [FSDP][Collectives] skipping reduce_scatter when world size is 1 (#160136)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size ‎ = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command.

**Test Cases**
1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1
2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160136
Approved by: https://github.com/weifengpy
ghstack dependencies: #160135
2025-08-19 02:13:30 +00:00
8d15af2320 [PT2]: Allow None for wrapped_fbgemm_linear_fp16_weight (#160802)
Summary: Currently the implementation of [fbgemm_linear_fp16_weight](https://www.internalfb.com/code/fbsource/[ffe8ba561cb6af33fde5b32c27411d6d3f4f2c70]/fbcode/caffe2/aten/src/ATen/native/QuantizedLinear.cpp?lines=477) does not allow None for `bias`, but it's actually a valid case and internally `fbgemm_linear_fp16_weight_fp32_activation` accept None bias as well. For BC reason, we can't directly change the function signature. So wrapping an empty tensor if bias is None to workaround it in Sigmoid.

Test Plan:
P1906210273
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=778442870
SNAPSHOT_ID=6
MODULE=user
SUFFIX=.predictor.precompute.remote_request_only

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=10000 &>  ~/logs/${MODEL_TYPE}/load_net_predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}_${MODULE}
```

Rollback Plan:

Reviewed By: henryoier, hl475

Differential Revision: D80382652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160802
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-08-19 01:46:53 +00:00
e9209e0854 [dynamo] Refactor tracer logic in convert_frame so that it doesn't leak to outer layer. [1/n] (#160814)
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).

One incremental step we can take is to refactor out InstructionTranslator as a functional piece providing bytecode tracing.

To separate out this part, we notice currently the tracer object is being passed around in the entire convert frame compile function. This is not very ideal because we want to build a boundary between the tracing and downstream compiler stack. Ideally, we should extract all the relevant information out of the tracer object and return a new data structure that is free of internal states of InstructionTranslator.

Luckily, there aren't many data used from tracer, after tracing is finished. The major one is OutputGraph, other than that, we only need to record two boolean flags for error handling purposes.

The new type we're adding is called DynamoTracerOutput, which contains all the information needed by torch.compile internal after symbolic convert is finished. To simplify the current PR, we leave out the part which reduce OutputGraph into a minimal set, since this can be done in a separate PR.

Differential Revision: [D80388693](https://our.internmc.facebook.com/intern/diff/D80388693/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160814
Approved by: https://github.com/tugsbayasgalan
2025-08-19 01:46:24 +00:00
4cb31015f2 [dynamic shapes] prims_common non_overlapping_and_dense (#160462)
Differential Revision: D80120333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160462
Approved by: https://github.com/laithsakka
2025-08-19 01:35:28 +00:00
5e98d9f9ba Revert "[dynamic shapes] unbacked-safe slicing (#157944)"
This reverts commit 56218d85e2da09d9ede3809718ec989c2151632c.

Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think this is failing test_draft_export in trunk 56218d85e2 ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3198874677))
2025-08-19 01:16:17 +00:00
5cf6567c1f [Inductor] add cuda compile cmd to autotuning logging (#160906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160906
Approved by: https://github.com/henrylhtsang
2025-08-19 01:14:46 +00:00
41b3e80a55 Fix duplicated kernel name in kernel stack trace tracking (#160905)
Summary: as title. When we have two kernels with the same name, the stack traces should be appended, not overwritten.

Test Plan:
```
 buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D80472731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160905
Approved by: https://github.com/angelayi
2025-08-19 01:14:34 +00:00
b6852778ff Add Magma build for CUDA 13.0 (#160770)
Add magma build for CUDA 13.0 after almalinux docker is available

https://github.com/pytorch/pytorch/issues/159779
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160770
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-08-19 01:10:00 +00:00
1853f71b4f [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)
Fixes #160243, Fixes #160244, Fixes #160245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403
Approved by: https://github.com/janeyx99
2025-08-19 00:54:51 +00:00
bbc7c03e93 Fix UndefinedGrad::apply (#160572)
The function incorrectly reserved space in the input parameter instead of the output parameter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160572
Approved by: https://github.com/soulitzer
2025-08-19 00:15:51 +00:00
dc200066cf [ONNX] Use onnxruntime 1.22 in CI (#160924)
Use onnxruntime 1.22 in CI to enable testing of newer opsets and IR versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160924
Approved by: https://github.com/titaiwangms
2025-08-19 00:05:26 +00:00
56218d85e2 [dynamic shapes] unbacked-safe slicing (#157944)
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-18 22:38:16 +00:00
0254646654 harden fabric checks for symmetric memory (#160790)
Now we check only that fabric allocation succeeded, but sometimes we fail during export or import afterwards, with no recourse. Check the full cycle before attempting to allocate memory with the fabric.
TODO: move it to c10/cuda so that it can be used from CUDACachingAllocator too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160790
Approved by: https://github.com/Skylion007
2025-08-18 22:35:50 +00:00
b439675ae2 [nativert] oss pass graph pass registration (#160859)
Summary: att

Test Plan:
CI

Rollback Plan:

Differential Revision: D80368343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160859
Approved by: https://github.com/georgiaphillips
2025-08-18 22:23:38 +00:00
82c7a1eb4b Revert "[ONNX] Default to dynamo export (#159646)"
This reverts commit 11b6ceb7b4f81ba02f88652136a93d685c399191.

Reverted https://github.com/pytorch/pytorch/pull/159646 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/159646#issuecomment-3198507767))
2025-08-18 21:41:32 +00:00
16ada80c61 [BE][CUDA][Distributed] Add require_exact_world_size() and a few distributed unit test fixes (#160803)
1. Add require_exact_world_size()
2. Decorate the test `test_new_subgroups_with_group_param` with this require_exact_world_size(4) as the test would fail with world_size of 8 when testing with 8xB200 runner.
3. Modify `test_new_subgroups_world_size_not_divisible_by_group_size` so that it will not fail due to 4 vs. 8 mismatch. Doing so makes the test pass with both 4-GPU runner and 8-GPU runner.

Separating these changes out from B200 distributed runner PR #159323

Fixes https://github.com/pytorch/pytorch/issues/159987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160803
Approved by: https://github.com/fduwjj
2025-08-18 21:15:33 +00:00
c27d6df1ea For sdists, replace symlink with copy for docs requirements (#157811)
Before this change, there was the requirements file `.ci/docker/requirements-docs.txt` which was symlinked as `../.ci/docker/requirements-docs.txt` from `docs/requirements.txt` since #151796.
In this situation, [because `.ci` is excluded from the source tarball](3173616532/.github/workflows/create_release.yml (L67)), we end up with a broken symlink, that additionally is [invalid in a Python source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/#unpacking-without-the-data-filter).

The broken symlink can be confirmed in [the rc sources](https://github.com/pytorch/pytorch/actions/runs/15892205745).

~After this change, there is still a single source of truth, which now is `docs/requirements.txt`, symlinked as `../docs/requirements.txt` from `.ci/docker/requirements-docs.txt`, which would also be invalid in a Python source distribution, but is not included in the tarball (see above). Additionally, the docs requirements that were missing from the previous tarball, are now actually included, allowing users to build the documentation again.~

@malfet clarified offline that there is a problem with the docs workflows because they use a cache with a key that includes the hash of the requirements document in the `.ci` folder, which now does no longer change when the requirements change. Hence, a different solution is needed~, though for now the problem remains~.

The solution in this PR is simply to copy the actual document to replace the symlink just prior to creating the source distribution. This way, a single document needs to be maintained, git checkouts remain as they are, and the source distributions contain the before-missing document.

A better solution may be implemented at a later stage with a better build system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157811
Approved by: https://github.com/atalman
2025-08-18 21:10:44 +00:00
d910cb3b2d [cpp][inductor] Fix crash on bmm when input is used twice. (#160087)
Fixes #156412

For torch.bmm using CPP generated template code, when the input is used as both the first and second weights, the generated code will simplify so it only passes one input instead of 2. However, if the weights are being repacked and saved for more efficient data-loading patterns, then we need to save both inputs instead of just one. This PR fixes this issue.

## Test code:
```python
import torch

@torch.compile(mode="max-autotune")
def my_function(x, y):
    return torch.bmm(x, x)

# Test
x = torch.randn(2, 3, 3)
y = torch.randn(2, 3, 3)
result = my_function(x, y)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160087
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-08-18 20:34:14 +00:00
a1a555ed7b [dynamo] Fix graph break on calling functions decorated with special context manager (#160703)
As title. This is a follow-up of the previous patch, with the goal of
supporting a new pattern that showed up in ComfyUI:
644b23ac0b/comfy/ops.py (L44)

Effectively, the semantics of calling a function decorated with a
context manager is:

```python
@ctx_manager(args)
def f(x):
    ...

f(x)
# ----->
with ctx_manager(args):
    f.__wrapped__(x)
```

Yes, a fresh context manager instance per invokation, see CPython source code:
https://github.com/python/cpython/blob/3.12/Lib/contextlib.py#L119-L122

So Dynamo already
1. knows how to handle the `with ctx_manager(args)` syntax, and has
   special handling for a few torch native context managers, like
   `sdpa_kernel` in this patch.
2. can trace through a good chunk (at least the ones that matter in this
   case) of contextlib.

This patch just let Dynamo trace a bit more into contextlib, and then
keep the torch-native special cases by moving their handling a bit down
the stack, so that no additional logic is introduced -- it's only
refactored.

This also allows us to get rid of some `_sdpa_kernel_variadic` special
handling, since now we will trace through its code, and it boils down to
`sdpa_kernel` anyways.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160703
Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos
ghstack dependencies: #160684
2025-08-18 20:33:45 +00:00
72b559b2c8 [dynamo] Fix crash and silent incorrectness issues in attention.sdpa_kernel calls with kwargs (#160684)
This patch fixes 2 issues, illustrated by the test cases added:
1. using `sdpa_kernel(backends=..., set_priority=...)` due to an
   internal assert that forgot to be updated after #147768.
2. forgetting to convert the `set_priority` VariableTracker back to a
   python constant so that its value is properly used by `sdpa_kernel`,
   also from #147768.

I ran into (1) because ComfyUI had a recent update that actually sues
this pattern
644b23ac0b/comfy/ops.py (L44),
and then noticed (2), and fixed it conveniently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160684
Approved by: https://github.com/mlazos
2025-08-18 20:33:45 +00:00
cyy
1f19003694 Use py3.10 for ONNX CI jobs (#160852)
Use Python 3.10 for ONNX jobs because Python 3.9 is near EOL and futher ONNX versions drop 3.9 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160852
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-08-18 19:37:47 +00:00
4e90441133 Add signpost to provenance tracking error (#160755)
Summary: As title, add signpost to better track error when computing provenance tracking related debugging information

Test Plan:
CI

Rollback Plan:

Differential Revision: D80292285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160755
Approved by: https://github.com/angelayi
2025-08-18 19:17:47 +00:00
bfcae7e1c1 [ROCm] Fix Sliding Window Attention in AOTriton integration code (#159773)
AOTriton implements Sliding Window Attention (SWA) as a more generalized version of causal masks and also needs an atomic counter for dynamic workload allocation.

Fixes #158308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159773
Approved by: https://github.com/jeffdaily
2025-08-18 18:45:58 +00:00
01bba62e21 Remove unused test code (#160823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160823
Approved by: https://github.com/Skylion007
2025-08-18 18:37:52 +00:00
6ac9035a84 [aoti-fx] Dynamic shapes support (#160766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160766
Approved by: https://github.com/jansel
ghstack dependencies: #160765
2025-08-18 18:14:08 +00:00
bab79824cb [aoti-fx] Initial AOTInductor FX (#160765)
Using the existing WrapperFxCodegen backend, this PR prototypes an AOT version of it which will directly return a graph module.

How to use:
```python
exported_gm = torch.export.export(model, inp, dynamic_shapes=dynamic_shapes).module()
compiled_gm = torch._inductor.aot_compile(
    exported_gm, inp, options={"fx_wrapper": True, "compile_threads": 1}
)
assert torch.allclose(model(*inp), compiled_gm(*inp))
```

The motivation behind this is that backends like ExecuTorch/MTIA would like to use inductor's optimization technologies, but might have their own graph lowering pipelines so they might not want to use AOTI (which generates an so).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160765
Approved by: https://github.com/jansel
2025-08-18 18:14:08 +00:00
162bf78df6 [dynamo] Support itertools.filterfalse (#160596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160596
Approved by: https://github.com/guilhermeleobas
2025-08-18 18:07:57 +00:00
450517f346 [Dynamo][Hierarchical Compile] Flatten tuple inputs for regions (#158812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158812
Approved by: https://github.com/anijain2305
ghstack dependencies: #158810, #158811
2025-08-18 18:03:11 +00:00
664005662a Recheck Autotune cache on Precompile serialization to prune compilation results (#158656)
This PR rechecks the autotune cache on Precompile.serialize(), allowing us to ahead of time save autotune results for statically compiled triton kernels, so that warm start does not need to check the autotune cache.

It has a few extra changes to make this work:

### Storing source code in TritonBundler
- We now store the source_code for statically compiled triton kernels instead of the hash of the source code in TritonBundler, so that we can easily access their source code when rechecking the autotune cache on PrecompileContext.serialize. To make sure that this is not a huge space concern, I ran the entire hugging face benchmark on training. The total space of `/tmp/torchinductor_jjwu/fxgraph` before my change was 1185004 KB (1.18 GB). After my change, this increased to 1207312 KB (1.2 GB), for an increased storage cost of ~1.8%, which seems safe.

- We now return early from recheck_autotune_cache if the number of triton kernels being compiled is 1, since there's no reason to check the cache at all in those cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158656
Approved by: https://github.com/zhxchen17
2025-08-18 17:55:10 +00:00
c0a1ae4404 Add is_cpu method to stable tensor type (#160212)
Porting torchaudio to use the stable api requires the `is_cuda` and `dtype` functions. It would be more convenient if these were methods of the stable tensor class rather than utilities one needed to call from the C api. This PR adds them as methods, mirroring how `is_cuda` and `get_device` are already defined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160212
Approved by: https://github.com/janeyx99
2025-08-18 17:42:43 +00:00
b0071c65e2 [MPS] Fix error check for torch.var on scalar (#160889)
Fixes https://github.com/pytorch/pytorch/issues/160738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160889
Approved by: https://github.com/Skylion007
ghstack dependencies: #160850
2025-08-18 17:36:42 +00:00
c6333f7dae Fixes for collections.NamedTuple (#159367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159367
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864, #159865
2025-08-18 17:32:59 +00:00
87d6831b2e Add CUDA installation script for CUDA 13 (#160201)
Add the almalinux docker for building magma-cuda 13.0
https://github.com/pytorch/pytorch/issues/159779

Also fixed the NVSHMEM download link

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160201
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-08-18 17:26:25 +00:00
4014672b30 Replace guard_serialization_mode with save_guards, remove load cases (#160531)
This PR replaces "guard_serialization_mode" into `save_guards`. All cases where we care about whether or not we're *loading* guards can be inferred automatically from the existing inputs.

The only case that's special here is whether or not to check guards. We don't want to check guards on guard load in CheckFnManager, because these guards have already been checked on save. Therefore, we put the setting in OutputGraphGuardsState, so that when we save, we bypass the guards check.

Because of this change, it is *technically* possible to do a load and a save in the *same* CheckFunctionManager.__init__() by passing all the necessary parts, and also passing `save_guards=True`. This should just work out of the box, but so far no callsites need it, so not super important.

Next up, we'll work on removing save_guards from GuardBuilder, and putting it into its own phase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160531
Approved by: https://github.com/zhxchen17
2025-08-18 17:04:17 +00:00
e389a08dcd AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360)
- This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD **ROCm 7.0** and the **gfx950** architecture.

  This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform.

- Validation (**ROCm 7.0** + **gfx950** required):

  `111 relevant tests passing.`

  > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v

  Co-author: @jagadish-amd —  Thank you for the efforts leading validation on gfx950 with ROCm 7.0.

-----------------------------------

This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types.

### Support for new scalar types and scaling methods:
* [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934)

* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364)

### Updates to scalar type mappings:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0.

* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version.

### Enhancements to testing(@jagadish-amd):
* [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472)

These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360
Approved by: https://github.com/drisspg, https://github.com/malfet
2025-08-18 16:43:09 +00:00
f2be3dc8da [dynamo][guards] Optimize module getattr access for inline flag (#160864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160864
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #160863
2025-08-18 16:38:46 +00:00
b8ff0fd21b [dynamo][guards] Remove long lines from TORCH_LOGS=guards (#160863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160863
Approved by: https://github.com/Lucaskabela
2025-08-18 16:38:46 +00:00
6b994c47ca [MPS][BE] Fix unused vars in GridSampler (#160850)
This fixes following warnings during the compilation of GridSampler.metal
```
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:22:23: warning: unused parameter 'input_sizes' [-Wunused-parameter]
    constant int32_t* input_sizes,
                      ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/GridSampler.metal:24:23: warning: unused parameter 'grid_sizes' [-Wunused-parameter]
    constant int32_t* grid_sizes,
                      ^
2 warnings generated.
```

Introduced by https://github.com/pytorch/pytorch/pull/160541
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160850
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-18 16:24:45 +00:00
3c8c509a9c [export] Fix custom ops in subgraphs (#160004)
Fixes https://github.com/pytorch/pytorch/issues/159995

Currently there are two problems with extern kernels in subgraphs:
1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph.
2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0.

So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004
Approved by: https://github.com/ydwu4
2025-08-18 15:42:19 +00:00
1091165826 [export] Update move_to_device_pass for to.device (#160528)
Differential Revision: D80135455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160528
Approved by: https://github.com/yushangdi
2025-08-18 15:41:48 +00:00
d91a03f96a [ROCm] Add HIPConfig.h to .gitignore like CUDAConfig.h. (#159805)
This file is generated into the source directory by CMake just like `cuda/CUDAConfig.h`, so it seems appropriate to add it to `.gitignore` in the same place: 83ba3f1101/aten/src/ATen/CMakeLists.txt (L39-L47)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159805
Approved by: https://github.com/jeffdaily
2025-08-18 15:34:01 +00:00
0298ebc97a [ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001)
GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small.  This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`.

In the PR comment below, I include a short analysis of the numerics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001
Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily
2025-08-18 15:33:30 +00:00
179511694c Update slow tests (#160870)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160870
Approved by: https://github.com/pytorchbot
2025-08-18 11:53:41 +00:00
e7c3b77b22 [xla hash update] update the pinned xla hash (#160871)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160871
Approved by: https://github.com/pytorchbot
2025-08-18 11:50:47 +00:00
95e456fcc5 [inductor] pack linear for FP32 dynamic mode (#157542)
Summary:
Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode.
I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2%  improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-18 10:18:46 +00:00
de744ca4b1 [Inductor] modify convert_to_reinterpret_view (#158914)
**Summary:**
Fix https://github.com/pytorch/pytorch/issues/159121, Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: relax the condition of `isinstance(x_unwrap_view, (ReinterpretView, Buffer))` to `isinstance(x_unwrap_view, (ReinterpretView, Buffer, MutableBox))`. Prefer channels last format according to how the format of `x_unwrap_view_fx_node` is set from eager.

**Example:**
```
import torch
import torch.nn as nn

class M(nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        n, c, h, w = x.shape
        return self.relu(x).permute(0, 2, 3, 1).reshape(
            n, h * w, c
        )

model = M().eval()
x = torch.randn(2, 32, 4, 4).to(memory_format=torch.channels_last)

compiled_model = torch.compile(model)

with torch.no_grad():
    compiled_model(x)
```

**Generated code:**
- before
```
cpp_fused_permute_relu_view_0 = async_compile.cpp_pybinding(['const float*', 'float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const float* in_ptr0,
                       float* out_ptr0,
                       float* out_ptr1)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(16L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(16L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(32L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(16L)))
                        {
                            alignas(std::max(std::size_t(16), alignof(float))) float tmp0[16*16];
                            transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(in_ptr0 + static_cast<int64_t>(x1 + 32L*x2 + 512L*x0), static_cast<int64_t>(32L), tmp0, static_cast<int64_t>(16));
                            for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++)
                            {
                                auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16L*x1_inner), static_cast<int64_t>(16));
                                auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0));
                                tmp2.store(out_ptr0 + static_cast<int64_t>(x2 + 16L*x1 + 16L*x1_inner + 512L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(16L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L) && x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                        {
                            alignas(std::max(std::size_t(16), alignof(float))) float tmp0[16*16];
                            transpose_mxn<float,static_cast<int64_t>(16),static_cast<int64_t>(16),false>(out_ptr0 + static_cast<int64_t>(x1 + 16L*x2 + 512L*x0), static_cast<int64_t>(16L), tmp0, static_cast<int64_t>(16));
                            for (long x1_inner = 0; x1_inner < static_cast<int64_t>(16); x1_inner++)
                            {
                                auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<int64_t>(16L*x1_inner), static_cast<int64_t>(16));
                                tmp1.store(out_ptr1 + static_cast<int64_t>(x2 + 32L*x1 + 32L*x1_inner + 512L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32))
    buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 16, 4, 1), torch.float32)
    buf1 = empty_strided_cpu((2, 16, 32), (512, 32, 1), torch.float32)
    cpp_fused_permute_relu_view_0(arg0_1, buf0, buf1)
    del arg0_1
    return (buf1, )
```

- After
```
cpp_fused_relu_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1024L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1024L)))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = at::vec::clamp_min(tmp0, decltype(tmp0)(0));
                    tmp1.store(out_ptr0 + static_cast<int64_t>(x0));
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 32, 4, 4), (512, 1, 128, 32))
    buf0 = empty_strided_cpu((2, 32, 4, 4), (512, 1, 128, 32), torch.float32)
    cpp_fused_relu_0(arg0_1, buf0)
    del arg0_1
    return (reinterpret_tensor(buf0, (2, 16, 32), (512, 32, 1), 0), )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158914
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-18 07:41:20 +00:00
b82aa3df20 Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197)"
This reverts commit e444cd24d48b3a46f067974f2cc157f5ed27709f.

Reverted https://github.com/pytorch/pytorch/pull/159197 on behalf of https://github.com/laithsakka due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/159197#issuecomment-3195436668))
2025-08-18 07:22:13 +00:00
d8d589bd3a Add build support for RISCV (#160172)
In requirements.txt, do not install lintrunner on riscv64

Fixes #160170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160172
Approved by: https://github.com/malfet
2025-08-18 05:29:34 +00:00
3c6efd1380 Add cutedsl template support to compile (#160108)
## Summary
Still figuring out what actually writing a template should look like, but lands alot of the base infra

<img width="1267" height="262" alt="Screenshot 2025-08-16 at 10 22 12 PM" src="https://github.com/user-attachments/assets/229f8bfa-0cb4-4fb1-8530-f535e569d350" />

Test code:

```Python
#!/usr/bin/env python3
"""
Fixed CuteDSL template test with proper def_kernel usage.
"""

import torch
import torch._inductor.config as config
from torch._inductor.lowering import lowerings
from torch._inductor.ir import TensorBox
from torch._inductor.select_algorithm import autotune_select_algorithm
from torch._inductor.codegen.cutedsl import CuteDSLTemplate

def create_fixed_cutedsl_template():
    """Create a properly structured CuteDSL template."""

    def cutedsl_grid(M, N, meta):
        return (1,)

    # Part 1: Imports and kernel definition
    template_part1 = r"""
import torch
import cutlass
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack

@cute.kernel
def {{kernel_name}}_kernel(gA: cute.Tensor, gB: cute.Tensor, gC: cute.Tensor):
    # Get thread and block indices
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()

    thread_idx = bidx * bdim + tidx
    m, n = gA.shape

    if thread_idx < m * n:
        mi = thread_idx // n
        ni = thread_idx % n

        if mi < m and ni < n:
            a_val = gA[mi, ni]
            b_val = gB[mi, ni]
            result = a_val + b_val
            gC[mi, ni] = a_val + b_val
"""

    # Part 2: JIT wrapper function
    template_part2 = r"""
@cute.jit
def {{kernel_name}}_jit(mA: cute.Tensor, mB: cute.Tensor, mC: cute.Tensor):
    m, n = mA.shape
    total_threads = m * n
    threads_per_block = 256
    num_blocks = (total_threads + threads_per_block - 1) // threads_per_block

    kernel = {{kernel_name}}_kernel(mA, mB, mC)
    kernel.launch(
        grid=[num_blocks, 1, 1],
        block=[threads_per_block, 1, 1]
    )
"""

    # Part 3: Main kernel function
    template_part3 = r"""
{{def_kernel("input_a", "input_b", "output_c")}}
    cute_a = from_dlpack(input_a, assumed_align=16)
    cute_b = from_dlpack(input_b, assumed_align=16)
    cute_c = from_dlpack(output_c, assumed_align=16)

    # Launch kernel
    {{kernel_name}}_jit(cute_a, cute_b, cute_c)

    return output_c
"""

    # Combine all parts
    template = CuteDSLTemplate(
        name="fixed_add",
        grid=cutedsl_grid,
        source=template_part1 + template_part2 + template_part3
    )

    return template

def fixed_cutedsl_lowering(a: TensorBox, b: TensorBox) -> TensorBox:
    """Fixed CuteDSL lowering."""
    print(f"[FIXED] CuteDSL lowering: {a.get_size()} + {b.get_size()}")

    template = create_fixed_cutedsl_template()
    choices = []

    error = template.maybe_append_choice(
        choices,
        input_nodes=[a.data, b.data],
        layout=a.get_layout()
    )

    if error or not choices:
        print(f"[FIXED] Falling back: {error}")
        default_lowering = lowerings[torch.ops.aten.add.Tensor]
        return default_lowering(a, b)

    print(f"[FIXED] Using CuteDSL with {len(choices)} choices")

    result = autotune_select_algorithm(
        "fixed_cutedsl_add",
        choices,
        [a, b],
        a.get_layout(),
    )

    return result

def test_fixed_cutedsl():
    """Test the fixed CuteDSL template."""
    print("=" * 50)
    print("Fixed CuteDSL Template Test")
    print("=" * 50)

    original = lowerings.get(torch.ops.aten.add.Tensor, None)

    try:
        lowerings[torch.ops.aten.add.Tensor] = fixed_cutedsl_lowering

        def test_add(x, y):
            return x + y

        device = "cuda" if torch.cuda.is_available() else "cpu"
        x = torch.randn(128, 4, device=device, dtype=torch.float32)
        y = torch.randn(128, 4, device=device, dtype=torch.float32)

        print(f"[FIXED] Testing with {x.shape} tensors on {device}")

        compiled_fn = torch.compile(test_add, backend="inductor")
        result = compiled_fn(x, y)

        # Verify correctness
        expected = x + y
        if torch.allclose(result, expected, atol=1e-5):
            print(" [FIXED] Results match!")
            return True
        else:
            print(" [FIXED] Results don't match!")
            return False

    except Exception as e:
        print(f" [FIXED] Failed: {e}")
        import traceback
        traceback.print_exc()
        return False

    finally:
        if original:
            lowerings[torch.ops.aten.add.Tensor] = original
        else:
            lowerings.pop(torch.ops.aten.add.Tensor, None)

if __name__ == "__main__":
    success = test_fixed_cutedsl()
    print("🎉 Fixed test completed!" if success else "💥 Fixed test failed!")

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160108
Approved by: https://github.com/mlazos
2025-08-18 04:37:15 +00:00
d18007a1d0 [vllm hash update] update the pinned vllm hash (#160847)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160847
Approved by: https://github.com/pytorchbot
2025-08-18 04:36:28 +00:00
138413907a [nativert] oss subgraph rewriter (#160780)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D80367765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160780
Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips
2025-08-18 04:25:05 +00:00
3ced4f1e6c Revert "Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)"
This reverts commit 7a68d02292fd7a430b55c5bce3268a33c7ec5055.

Reverted https://github.com/pytorch/pytorch/pull/160836 on behalf of https://github.com/clee2000 due to broke some inductor jobs? Maybe just update the expected values? Not sure what the policy is for something like this [GH job link](https://github.com/pytorch/pytorch/actions/runs/17024529273/job/48262123844) [HUD commit link](7a68d02292) ([comment](https://github.com/pytorch/pytorch/pull/160836#issuecomment-3194953213))
2025-08-18 03:09:31 +00:00
075a2e6967 [PGO] add extra read/write keys (#160715)
Differential Revision: D80321215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160715
Approved by: https://github.com/bobrenjc93
2025-08-18 01:41:08 +00:00
cyy
7a68d02292 Use numpy 1.26.2 for Python 3.9 and 3.10 (#160836)
Because numpy 1.22.4 had reached EOL 3 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160836
Approved by: https://github.com/malfet
2025-08-17 18:39:06 +00:00
63e1b58a13 [easy] [Precompile] Refactor guards, improve typing (#160530)
Purely a refactor, improve typing and get rid of some type errors. Make certain fields as nonnull, since in general it's not empty.

The goal of this stack of PRs is to move the save/load logic of guard serialization into separate, flat phases, instead of being embedded in guard creation. This way, we can put a try/catch around it and fail safely if certain guards are not serializable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160530
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-08-17 17:54:55 +00:00
cyy
960c03daf6 Remove unused CONDA_CMAKE option (#160832)
Remove CONDA_CMAKE from `.ci/docker/build.sh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160832
Approved by: https://github.com/malfet
2025-08-17 17:08:42 +00:00
04c7be903d Revert "[BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)"
This reverts commit 8f434545c2e48c858d8b0d06db8f9642d6a87ad0.

Reverted https://github.com/pytorch/pytorch/pull/160747 on behalf of https://github.com/malfet due to Looks like this breaks rocm, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy-rocm-py3.10 ([comment](https://github.com/pytorch/pytorch/pull/160747#issuecomment-3194417733))
2025-08-17 14:22:48 +00:00
691d17a5c6 Update TensorPipe submodule (#160808)
To a commit containing  https://github.com/pytorch/tensorpipe/pull/464 that fixes compilation with CUDA-13

Fixes https://github.com/pytorch/pytorch/issues/160104
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160808
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007, https://github.com/malfet
2025-08-17 14:11:41 +00:00
c699668009 [inductor] TLParse tensor metadata logging + test (#160132)
Summary:
- Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation.

Testing:
- Add test to verify structure and contents of tlparse artifiact

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132
Approved by: https://github.com/xmfan
2025-08-17 04:27:49 +00:00
0b56f3aed8 [vllm hash update] update the pinned vllm hash (#160831)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160831
Approved by: https://github.com/pytorchbot
2025-08-17 04:25:26 +00:00
8f434545c2 [BE] [Inductor] Re-Land Support TMA before strict 3.4 cutoff (#160747)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
2025-08-17 00:35:12 +00:00
26297c27e2 Revert "[inductor] TLParse tensor metadata logging + test (#160132)"
This reverts commit 2603e40be5fa4a66301e6654e34a82a67f2e4913.

Reverted https://github.com/pytorch/pytorch/pull/160132 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/17010600949/job/48226137423) [HUD commit link](2603e40be5).  landrace with another PR that changed some had_cuda related things ([comment](https://github.com/pytorch/pytorch/pull/160132#issuecomment-3193969792))
2025-08-16 23:47:03 +00:00
74871d4d46 [collections.abc] Ensure that binop calls works with UserDefinedObjects (#159865)
Changes:
(1) Replace UserDefinedSetVariable by UserDefinedObjectVariable in all binop calls

Test plan:
(1) The three tests from CPython `test_collections.py` ensures that Dynamo can trace through a dunder method (e.g. __add__, __ixor__, etc) defined in a user defined class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159865
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483, #159902, #159864
2025-08-16 20:44:40 +00:00
f019da2979 Implement list(UserDefinedObject) via force_unpack_var_sequence (#159864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159864
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483, #159902
2025-08-16 20:44:40 +00:00
f1bc843a5d Wrap class definitions in set_fullgraph(False) in test_collections (#159902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159902
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368, #159483
2025-08-16 20:42:15 +00:00
2603e40be5 [inductor] TLParse tensor metadata logging + test (#160132)
Summary:
- Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation.

Testing:
- Add test to verify structure and contents of tlparse artifiact

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160132
Approved by: https://github.com/xmfan
ghstack dependencies: #160260
2025-08-16 16:37:18 +00:00
8fe4b3f848 [BE][CI] move MYPYSTRICT linter from lintrunner-noclang to lintrunner-mypy (#160806)
Like `MYPY`, linter `MYPYSTRICT` will need `--all-files` too.

See also:

- https://github.com/pytorch/pytorch/pull/160652#issuecomment-3193390813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160806
Approved by: https://github.com/seemethere
2025-08-16 16:15:22 +00:00
cff6def7f4 [MTIA] add correct name for CFF in tlparse (#160599)
Differential Revision: D80201622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160599
Approved by: https://github.com/bdhirsh
2025-08-16 14:58:03 +00:00
e444cd24d4 Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197)
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
2025-08-16 09:15:58 +00:00
a84541c73f Update transformers version automatically with Dependabot (#160635)
My proposal here is to use GitHub Dependabot to make sure that `transformers` version used in CI are always up-to-date.  To achieve this, this PR does 2 things:

1. Pin `transformers` version across all CI jobs to only one place at `.ci/docker/ci_commit_pins/huggingface.txt`.  This file is now a regular pip requirements instead of a pinned commit text.  There isn't any need to pin `transformers` to a specific commit and the file already refers to a stable version `v4.54.0`
2. Create `.github/dependabot.yml` to config the bot to update `transformers` automatically when there is a new version.  Those labels will ensure that the right reviewers from torch.compile and Dev Infra are notified.  I'm not sure how to test this out in PR, but it feels ok to land and test this in main.  If this works, we should see a PR to update `v4.54.0` to the current latest `v4.55.0`

### Reference
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160635
Approved by: https://github.com/ZainRizvi
2025-08-16 05:53:39 +00:00
114813ca77 Fix mypy errors: PyTreeSpec inheritance (#160652)
Fixes #160650.

I added type ignore comment to `LeafSpec` class inheritance in `torch/utils/_cxx_pytree.py` to handle `PyTreeSpec` being marked as final in optree's type stubs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160652
Approved by: https://github.com/Skylion007
2025-08-16 05:14:11 +00:00
11b6ceb7b4 [ONNX] Default to dynamo export (#159646)
Set dynamo=True and enable fallback.

1. Implemented the compatible behavior where BytesIO objects as `f` is accepted
2. Update tests to explicitly set dynamo=False

#151693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159646
Approved by: https://github.com/titaiwangms
2025-08-16 04:48:58 +00:00
fb7e60ba7a [Dynamo][Hierarchical Compile] Flatten tuple outputs in graph dedupe pass (#158811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158811
Approved by: https://github.com/anijain2305
ghstack dependencies: #158810
2025-08-16 04:45:31 +00:00
f89186e910 [audio hash update] update the pinned audio hash (#160797)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160797
Approved by: https://github.com/pytorchbot
2025-08-16 04:26:59 +00:00
10eb83734f [vllm hash update] update the pinned vllm hash (#160699)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160699
Approved by: https://github.com/pytorchbot
2025-08-16 04:26:55 +00:00
75ea93484c [vllm test] add vllm.yml and additional package (#160698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160698
Approved by: https://github.com/huydhn
ghstack dependencies: #160116
2025-08-16 04:24:20 +00:00
45c2c7a5fc Fix the wrong dataclasses_json mointoring dep MacOS test (#160796)
Typo mistake.  This should be `dataclasses_json` https://github.com/pytorch/pytorch/actions/runs/17000197828/job/48200676725#step:10:23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160796
Approved by: https://github.com/yangw-dev
2025-08-16 04:00:31 +00:00
b74c7cd335 Add kernel stack traces tlparse dump (#160608) (#160779)
Summary:

as title

This is requested by the zoomer team so they can add stack trace information to profiler result.

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r  stack_traces
```

Rollback Plan:

Differential Revision: D80050233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160779
Approved by: https://github.com/angelayi
2025-08-16 03:12:38 +00:00
b7ca502f29 [ROCm][Windows] Add hipcc compatibility flags to cpp_extension.py. (#159790)
This is a similar change to https://github.com/pytorch/pytorch/pull/153986, this time adding flags to the hipcc command under `cpp_extension.py`.

The `-Wno-ignored-attributes` flag in particular avoids about 200MB of warning spam when building torchvision, like these:
```
In file included from D:\b\vision_main\torchvision\csrc\ops\hip\deform_conv2d_kernel.hip:72:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ATen.h:13:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/Functions.h:386:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax.h:21:
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax_ops.h:18:8: warning: __declspec attribute 'dllimport' is not supported [-Wignored-attributes]
   18 | struct TORCH_API _sparse_softmax_int {
      |        ^~~~~~~~~
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h💯19: note: expanded from macro 'TORCH_API'
  100 | #define TORCH_API C10_IMPORT
      |                   ^~~~~~~~~~
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:53:31: note: expanded from macro 'C10_IMPORT'
   53 | #define C10_IMPORT __declspec(dllimport)
      |                               ^~~~~~~~~
```

The `-fms-extensions` flag just seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html.

See also this downstream issue where these changes were tested: https://github.com/ROCm/TheRock/issues/910.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159790
Approved by: https://github.com/jeffdaily
2025-08-16 02:20:49 +00:00
7bd4cfaef4 [BE] Update nvshem dependency to 3.3.20 (#160458)
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix https://github.com/pytorch/pytorch/issues/160425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
2025-08-16 02:00:57 +00:00
c015e53d37 Revert "[BE] Update nvshem dependency to 3.3.20 (#160458)"
This reverts commit e0488d9f00865fb56c931580c80e099771c6285e.

Reverted https://github.com/pytorch/pytorch/pull/160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](https://github.com/pytorch/pytorch/pull/160458#issuecomment-3193133706))
2025-08-16 01:47:42 +00:00
65dc4df74d unify broadcast_shapes functions and avoid duplicates (#160251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160251
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
ghstack dependencies: #160250
2025-08-16 00:54:32 +00:00
c03809e8a5 guard_or_false cat ops (#160250)
keep existing unbacked semantics unchanged, just use guard_or_false instead of guard_size_obl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160250
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh
2025-08-16 00:54:31 +00:00
e0488d9f00 [BE] Update nvshem dependency to 3.3.20 (#160458)
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix https://github.com/pytorch/pytorch/issues/160425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
2025-08-16 00:50:13 +00:00
f782c790df migrate more simple gso checks (#160253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160253
Approved by: https://github.com/bobrenjc93
2025-08-16 00:15:24 +00:00
16ce2c15fa Add python 3.14 support to linux aarch64 builds (#160788)
Related to https://github.com/pytorch/pytorch/issues/156856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160788
Approved by: https://github.com/malfet
2025-08-16 00:03:21 +00:00
0d28d12b11 Fix typo packing libnvshmem into libtorch (#160778)
Fix typo after https://github.com/pytorch/pytorch/pull/160465
Fixes: https://github.com/pytorch/pytorch/issues/160762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160778
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/Skylion007
2025-08-15 23:43:02 +00:00
838f22c57d Do not incorrectly chain each of the strings as iterables (#160709)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160709
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-08-15 23:22:24 +00:00
eqy
387fe847ab [cuDNN][SDPA] Introduce TORCH_CUDNN_SDPA_AVOID_RECOMPILE=1 (#155958)
Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths.

Built on top of #149282

Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155958
Approved by: https://github.com/drisspg
2025-08-15 21:59:18 +00:00
40311e2ec1 [AOTInductor] ABI-Compatibility for RecordFunction. (#159842)
Summary:
Previous our implementation for RecordFunction injects Aten into
codegen, which is breaking the ABI contract for AOTInductor.

C10::IValue is aded to call the full record function. The extension of
more profiling info will come in later PRs.

Test Plan:
Included in commit.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D79622071](https://our.internmc.facebook.com/intern/diff/D79622071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159842
Approved by: https://github.com/desertfire
2025-08-15 21:45:47 +00:00
8ca8b6053c [inductor][while_loop][be] improve the readability of output handling (#160374)
The logic doesn't change but make it easier to read and change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160374
Approved by: https://github.com/zou3519
ghstack dependencies: #160548
2025-08-15 20:13:12 +00:00
ff86509a06 [map] filter none gradients and add autograd inductor tests (#160548)
Will filter the none outputs in autograd backward for other hops as follow ups

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160548
Approved by: https://github.com/zou3519
2025-08-15 20:13:12 +00:00
fa75ba9303 Change IR node's stack traces to return a set of stack traces only (#160701)
Summary: There can be excessive stack trace outputs in TORCH_LOGS="+inductor" when a single line of code corresponds to many post grad nodes, e.g. `self.multihead_attn(x, x, x)`, in that case, we'll see the same stack trace many times in the IR node, spamming the output log. So we change to return a set of stack traces.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80310549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160701
Approved by: https://github.com/angelayi
2025-08-15 19:31:59 +00:00
b78968b4d1 Support next(iterator, default) (#159483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159483
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366, #159368
2025-08-15 19:08:21 +00:00
e5621b4d8b Fixes for collections.Counter (#159368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159368
Approved by: https://github.com/mlazos
ghstack dependencies: #159365, #159366
2025-08-15 19:08:21 +00:00
2542e71f3f Change mutation type of MutableMappingVariable to AttributeMutationNew (#159366)
Also add MutableMappingVariable to `call_or_` / `call_ior`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159366
Approved by: https://github.com/zou3519
ghstack dependencies: #159365
2025-08-15 19:08:21 +00:00
0242d40fa5 Enable trace through the collections module (#159365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159365
Approved by: https://github.com/zou3519
2025-08-15 19:08:21 +00:00
17de899709 Add py3.14 to macos arm64 (#160593)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160593
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-08-15 18:52:10 +00:00
25d0d8b0a3 [inductor] Fix propagating torch.utils._sympy.functions.Identity in IndexPropagation (#155504)
Fixes https://github.com/pytorch/pytorch/issues/160535

Index may contain ` torch.utils._sympy.functions.Identity`. When we call `SymPyOps.index_expr`, if the value is a sympy.Expr with Identity, `TypedExpr(value, dtype)` will fail. So when we unwrap arguments, we expand the sympy expression to unwrap Identity.

Test Plan:
buck run @mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_expr_indexing

Rollback Plan:

Differential Re vision: D76308640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155504
Approved by: https://github.com/eellison
2025-08-15 18:38:23 +00:00
c6d697ff52 port 2 distributed pipeline test files for Intel GPU (#159140)
it's another pr to port distributed pipeline test for Intel GPU, while the other pr is https://github.com/pytorch/pytorch/pull/159033.
In this pr, we port two test files for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. instantiate_device_type_tests()
2. skip the case at xpu due to accuracy gap introduced by oneDNN non-deterministic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159140
Approved by: https://github.com/guangyey, https://github.com/d4l3k, https://github.com/H-Huang
2025-08-15 18:29:50 +00:00
30d2f98daa Revert "[cutlass backend] re-add pip cutlass path (#160180)"
This reverts commit d556586448f3caab85673c7da0978fe31c7748f7.

Reverted https://github.com/pytorch/pytorch/pull/160180 on behalf of https://github.com/atalman due to broke macos nightly ([comment](https://github.com/pytorch/pytorch/pull/160180#issuecomment-3192311552))
2025-08-15 18:00:41 +00:00
8780d28c65 raise exception in case of errors in memory reordering (#160455)
This PR introduce two checks in the memory reordering pass to catch graph issues before performing the reordering task. For situation not covered by these checks, the reordering pass might fail and an exception will be thrown in this case.

This addresses issue -- https://github.com/pytorch/pytorch/issues/159568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160455
Approved by: https://github.com/eellison
2025-08-15 17:31:55 +00:00
da8f48d88f [associative_scan] support gen_schema for associative_scan (#158883)
In-place mutation may create inter-loop dependency that breaks the parallelism we have for associative_scan so we ban input mutations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158883
Approved by: https://github.com/zou3519
ghstack dependencies: #154193, #158965, #158863, #158864
2025-08-15 17:28:44 +00:00
cb9e2092a8 [scan] support gen_schema for scan (#158864)
We don't want to allow scan's combine_fn to mutate its inputs. The semantic of the mutation can be confusing. For example:
```python
def combine_fn(init, x):
```
If combine_fn mutates init, only first iteration mutates init, the rest of the iterations mutates the previous carry, which is an intermediate result. This is kind of a weird semantic because the only observable mutation is for init, which can be done outside of the combine_fn.

If combine_fn mutates x, where x is a slice of scanned inputs (i.e. xs), this pattern is more meaningful but we've not seen any use case yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158864
Approved by: https://github.com/zou3519
ghstack dependencies: #154193, #158965, #158863
2025-08-15 17:28:44 +00:00
f6bf1573fc [while_loop] support gen_schema for while_loop (#158863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158863
Approved by: https://github.com/zou3519
ghstack dependencies: #154193, #158965
2025-08-15 17:28:34 +00:00
82a18423be [BE] create an empty shape_env for check_input_alias_and_mutation_return_outputs (#158965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158965
Approved by: https://github.com/zou3519
ghstack dependencies: #154193
2025-08-15 17:28:20 +00:00
3fe3c23d4e [cond] support gen_schema for cond (#154193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193
Approved by: https://github.com/zou3519
2025-08-15 17:28:13 +00:00
052c441cf4 Add logging for when inbuilt_inline_nn_modules will help with ID_MATCH guard triggered recompiles (#160592)
We add a logging around when an ID_MATCH guard is added at a place where inbuilt_inline_nn_modules would inline it. This is done with the aim of tagging recompiles that could be avoided by setting inbuilt_inline_nn_modules flag.
It will help us log and track the flag's adoption and potentially quantify saving in the the number of recompiles.

Differential Revision: D80075975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160592
Approved by: https://github.com/anijain2305
2025-08-15 17:09:39 +00:00
b26d2a9464 [ez] Make NUMA signpost parameters JSON serializable (#160710)
# Context
Broader context in #160163.

In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable.

# This PR
Convert `NumaOptions` to serializable form before inputting to `signpost_event`.

# Test Plan
## Automated
Added tests `$ pytest test/test_numa_binding.py`.

## Manual
See [D80317206](https://www.internalfb.com/diff/D80317206).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160710
Approved by: https://github.com/kiukchung
2025-08-15 16:52:43 +00:00
6382302990 [MPS] Add grid_sampler_3d for MPS (#160541)
This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation.

NOTE: "nearest" interpolation is not yet supported

Fixes #159882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541
Approved by: https://github.com/malfet
2025-08-15 16:19:25 +00:00
80dd05e31e Disable flaky cpp test RecordDebugHandles.Basic (#160577)
Test is flaky and sometimes hangs in CI

Here's an example of the failure:
https://github.com/pytorch/pytorch/actions/runs/16946153494/job/48027937663
```

2025-08-13T20:54:00.1223688Z ==================================== RERUNS ====================================
2025-08-13T20:54:00.1224156Z ___________________________ RecordDebugHandles.Basic ___________________________
2025-08-13T20:54:00.1224682Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13
2025-08-13T20:54:00.1225568Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6):
2025-08-13T20:54:00.1226430Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1226988Z Note: Google Test filter = RecordDebugHandles.Basic-*_CUDA:*_MultiCUDA
2025-08-13T20:54:00.1227450Z [==========] Running 1 test from 1 test suite.
2025-08-13T20:54:00.1227792Z [----------] Global test environment set-up.
2025-08-13T20:54:00.1228145Z [----------] 1 test from RecordDebugHandles
2025-08-13T20:54:00.1228492Z [ RUN      ] RecordDebugHandles.Basic
2025-08-13T20:54:00.1228822Z [       OK ] RecordDebugHandles.Basic (1 ms)
2025-08-13T20:54:00.1229204Z [----------] 1 test from RecordDebugHandles (1 ms total)
2025-08-13T20:54:00.1229501Z
2025-08-13T20:54:00.1229666Z [----------] Global test environment tear-down
2025-08-13T20:54:00.1230033Z [==========] 1 test from 1 test suite ran. (1 ms total)
2025-08-13T20:54:00.1230355Z [  PASSED  ] 1 test.
2025-08-13T20:54:00.1230727Z terminate called after throwing an instance of 'std::system_error'
2025-08-13T20:54:00.1231154Z   what():  Invalid argument
2025-08-13T20:54:00.1231416Z unknown file:0: C++ failure
2025-08-13T20:54:00.1231788Z ------------------------------ Captured c++ call -------------------------------
2025-08-13T20:54:00.1232262Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1232745Z Note: Google Test filter = RecordDebugHandles.Basic-*_CUDA:*_MultiCUDA
2025-08-13T20:54:00.1233199Z [==========] Running 1 test from 1 test suite.
2025-08-13T20:54:00.1233557Z [----------] Global test environment set-up.
2025-08-13T20:54:00.1233915Z [----------] 1 test from RecordDebugHandles
2025-08-13T20:54:00.1234247Z [ RUN      ] RecordDebugHandles.Basic
2025-08-13T20:54:00.1234590Z [       OK ] RecordDebugHandles.Basic (1 ms)
2025-08-13T20:54:00.1235020Z [----------] 1 test from RecordDebugHandles (1 ms total)
2025-08-13T20:54:00.1235304Z
2025-08-13T20:54:00.1235431Z [----------] Global test environment tear-down
2025-08-13T20:54:00.1235793Z [==========] 1 test from 1 test suite ran. (1 ms total)
2025-08-13T20:54:00.1236126Z [  PASSED  ] 1 test.
2025-08-13T20:54:00.1236481Z terminate called after throwing an instance of 'std::system_error'
2025-08-13T20:54:00.1236906Z   what():  Invalid argument
2025-08-13T20:54:00.1237287Z ___________________________ RecordDebugHandles.Basic ___________________________
2025-08-13T20:54:00.1237800Z [gw2] linux -- Python 3.13.5 /opt/conda/envs/py_3.13/bin/python3.13
2025-08-13T20:54:00.1238686Z Internal Error: calling /opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit for test RecordDebugHandles.Basic failed (returncode=-6):
2025-08-13T20:54:00.1239551Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1240048Z Note: Google Test filter = RecordDebugHandles.Basic-*_CUDA:*_MultiCUDA
2025-08-13T20:54:00.1240495Z [==========] Running 1 test from 1 test suite.
2025-08-13T20:54:00.1240848Z [----------] Global test environment set-up.
2025-08-13T20:54:00.1241199Z [----------] 1 test from RecordDebugHandles
2025-08-13T20:54:00.1241542Z [ RUN      ] RecordDebugHandles.Basic
2025-08-13T20:54:00.1241871Z [       OK ] RecordDebugHandles.Basic (1 ms)
2025-08-13T20:54:00.1242249Z [----------] 1 test from RecordDebugHandles (1 ms total)
2025-08-13T20:54:00.1242503Z
2025-08-13T20:54:00.1242641Z [----------] Global test environment tear-down
2025-08-13T20:54:00.1242993Z [==========] 1 test from 1 test suite ran. (19 ms total)
2025-08-13T20:54:00.1243329Z [  PASSED  ] 1 test.
2025-08-13T20:54:00.1243697Z terminate called after throwing an instance of 'std::system_error'
2025-08-13T20:54:00.1244113Z   what():  Invalid argument
2025-08-13T20:54:00.1244392Z unknown file:0: C++ failure
2025-08-13T20:54:00.1244759Z ------------------------------ Captured c++ call -------------------------------
2025-08-13T20:54:00.1245235Z CUDA not available. Disabling CUDA and MultiCUDA tests
2025-08-13T20:54:00.1283768Z ============== 1 failed, 568 passed, 2 rerun in 115.57s (0:01:55) ==============
```

Here's an example of the hang:
https://github.com/pytorch/pytorch/actions/runs/16942186826/job/48015238944
Logs aren't super helpful other than stating that it took a long time.  Usually this file takes <2min to run
```
2025-08-13T18:43:24.6586481Z [gw0] [ 97%] PASSED [1.4119s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/8
2025-08-13T18:43:24.6587278Z [gw1] [ 97%] PASSED [1.4866s] ../../../../../opt/conda/envs/py_3.13/lib/python3.13/site-packages/torch/bin/test_jit::PyTorch/LiteInterpreterDynamicTypeTestFixture::Conformance/9 Command took >30min, returning 124
2025-08-13T18:43:24.6587288Z
2025-08-13T18:43:24.6587632Z FINISHED PRINTING LOG FILE of cpp/test_jit 1/1 (test/test-reports/cpp.test_jit_1.1_c259e5a152845991_.log)
2025-08-13T18:43:24.6587639Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160577
Approved by: https://github.com/huydhn
2025-08-15 15:59:21 +00:00
9df07ecfbe Revert "[inductor] dont reuse buffers if it affects peak (#145883) (#159530)"
This reverts commit 3be70dc30e893b552fc0f23ca06cd8f7949b6d08.

Reverted https://github.com/pytorch/pytorch/pull/159530 on behalf of https://github.com/clee2000 due to newly added test fail internally D80316528, probably just a targets change, but also imo the tests should probably go into a testcase class from common or inductor utils.  While I'm pretty sure CI can run the globally defined ones, theres some CI related functionality that on the testcase class that CI benefits from ([comment](https://github.com/pytorch/pytorch/pull/159530#issuecomment-3191947506))
2025-08-15 15:49:04 +00:00
846963fa9b Revert "[Inductor] addmm + activation function fusion (#158137)"
This reverts commit b9d7de3a094598c3dc0dd52e57bce30eb684c9d8.

Reverted https://github.com/pytorch/pytorch/pull/158137 on behalf of https://github.com/malfet due to Broke inductor torchbench, see 663da17b62/1 ([comment](https://github.com/pytorch/pytorch/pull/158137#issuecomment-3191841298))
2025-08-15 15:34:09 +00:00
663da17b62 Update torch-xpu-ops commit pin (#160062)
Update the torch-xpu-ops commit to [77cc792cd265179745d335579d233e6d4f9a2667](77cc792cd2), includes:

- Ensures that the XPU cache is cleared before creating tensors during the test
- Add unused variable warning
- Fix test_linalg and test_torch issue with bf32_on_and_off updates
- Fix deterministic indexing with broadcast
- Fix dist.gather with noncontiguous tensor
- Improve accuracy of index put deterministic kernel
- Add generate file rely avoid build before generate
- optimize embedding bag

Fixes #160661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160062
Approved by: https://github.com/EikanWang
2025-08-15 15:27:24 +00:00
e299926f72 [ONNX] Fix doc typo for symbolic_multi_out (#160702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160702
Approved by: https://github.com/justinchuby
2025-08-15 14:34:42 +00:00
bbd11c4f23 Uninstall torchao on MPS benchmark (#160724)
Fixes https://github.com/pytorch/pytorch/issues/160689

The current torchao 0.12.0 doesn't work with transformers 4.54.0 and ends up with this error:

```
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/models/albert/modeling_albert.py", line 37, in <module>
    from ...modeling_utils import PreTrainedModel
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/transformers/modeling_utils.py", line 51, in <module>
    from torchao.quantization import Int4WeightOnlyConfig
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/__init__.py", line 41, in <module>
    from torchao.quantization import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/__init__.py", line 6, in <module>
    from .autoquant import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/quantization/autoquant.py", line 11, in <module>
    from torchao.dtypes import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/__init__.py", line 1, in <module>
    from . import affine_quantized_tensor_ops
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/affine_quantized_tensor_ops.py", line 38, in <module>
    from torchao.dtypes.uintx.dyn_int8_act_int4_wei_cpu_layout import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/__init__.py", line 7, in <module>
    from .dyn_int8_act_int4_wei_cpu_layout import (
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/dtypes/uintx/dyn_int8_act_int4_wei_cpu_layout.py", line 320, in <module>
    from ...prototype.inductor.fx_passes import register_da8w4_concat_linear_cpu_pass
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/__init__.py", line 2, in <module>
    from .int8_sdpa_fusion import _int8_sdpa_init
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/fx_passes/int8_sdpa_fusion.py", line 22, in <module>
    from ..int8_sdpa_lowering import register_int8_sdpa  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ec2-user/runner/_work/_temp/venv-3.12-1755212960/lib/python3.12/site-packages/torchao/prototype/inductor/int8_sdpa_lowering.py", line 6, in <module>
    from torch._inductor.kernel.flex_attention import construct_strides, maybe_realize
ModuleNotFoundError: No module named 'torch._inductor.kernel.flex_attention'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160724
Approved by: https://github.com/malfet
2025-08-15 13:55:39 +00:00
eaa5d9d3d3 Introduce OpInfo test for testing export on fake device (#160694)
Summary: Prepare for the upcoming diffs for exporting on fake cuda device.

Test Plan:
test

Rollback Plan:

Differential Revision: D80304225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160694
Approved by: https://github.com/dolpm
2025-08-15 07:26:28 +00:00
a7c75ae976 [dde] use sym_or when checking normalized shape in layer_norm (#160683)
Use `sym_eq` to check equality on tuple of ints/symints

### DDE
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, u1) (unhinted: Eq(u0, u1)).  (Size-like symbols: u1, u0)

Caused by: return torch.nn.functional.layer_norm(  # test/inductor/test_unbacked_symints.py:527 in fn (_refs/__init__.py:3292 in native_layer_norm)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160683
Approved by: https://github.com/bobrenjc93
2025-08-15 06:56:00 +00:00
f7ad69f59c [dynamic shapes] handle Max(*,1) for inductor layout contiguity (#160578)
Differential Revision: D80214882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160578
Approved by: https://github.com/ZixinYang, https://github.com/bobrenjc93
2025-08-15 06:10:18 +00:00
4cae9cf2df Update triton xpu commit to support python 3.14 (#160183)
Follow PR #159725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-15 05:41:17 +00:00
7710800865 [3/3][ghstack][vllm ci build setup]vllm build workflow (#160116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160116
Approved by: https://github.com/huydhn
2025-08-15 05:35:46 +00:00
aa99e0958f Separate provenance tracking to different levels (#160383)
Summary: as title. We've got request from various parties who are interested in turning on the provenance tracking by default. In this PR, we prepare to turn on part of the provenance tracking that doesn't have too much overhead by default.

- Change `provenance_tracking` config to `provenance_tracking_level`
- turn on the following provenance tracking by default when `basic_provenance_tracking`=True
    - `set_kernel_post_grad_provenance_tracing` for kernels, this add mapping between triton kernels and post_grad nodes
    - `dump_inductor_provenance_info` if we're dumping tlparse log
    - `get_graph_provenance_json` and dump `reate_mapping_pre_post_grad_nodes`. This creates mapping between pre_grad and post_grad nodes. Since we're not turning on the provenance tracking in GraphTransformObserver by default, the mapping here maybe incomplete/limited.
    - add stack trace from post grad nodes to inductor IR nodes
    - add exception swallowing for all functions above

Test Plan:
CI

Rollback Plan:

Differential Revision: D80031559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160383
Approved by: https://github.com/angelayi
2025-08-15 04:59:35 +00:00
3fc7a95176 [audio hash update] update the pinned audio hash (#160485)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160485
Approved by: https://github.com/pytorchbot
2025-08-15 04:27:49 +00:00
858fb80b9b [PT2]: Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight (#160451)
Summary: Add static dispatch kernel for wrapped_fbgemm_linear_fp16_weight. This optimization should improve perf for all Ads DSNN models using Sigmoid.

Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=37
OTHER_MODEL_ENTITY_ID=892669089
OTHER_SNAPSHOT_ID=36

MODULES=(mix prepare_float_features object user)
SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only)

for i in "${!MODULES[@]}"; do
MODULE=${MODULES[i]}
SUFFIX=${SUFFIXES[i]}
buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true
```

Before: P1900475429
I0810 19:29:22.782902 2717337 load_net_predictor_lib.cpp:1807] Average latency A: 0.0843 ms
I0810 19:29:22.782905 2717337 load_net_predictor_lib.cpp:1807] Average latency B: 0.0989 ms

After: P1900825771
I0811 15:42:34.866408 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency A: 0.0854 ms[0m
I0811 15:42:34.866411 2311279 load_net_predictor_lib.cpp:1807] [36mAverage latency B: 0.092 ms[0m

Still has some regression but the gap is smaller...

Rollback Plan:

Reviewed By: henryoier, muchulee8

Differential Revision: D80042054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160451
Approved by: https://github.com/henryoier
2025-08-15 04:06:17 +00:00
55061c9602 [PT2]: Add Static Dispatch Kernel for scale_gradient (#160454)
Summary: Add Static Dispatch Kernel for scale_gradient

Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=37
OTHER_MODEL_ENTITY_ID=892669089
OTHER_SNAPSHOT_ID=36

MODULES=(mix prepare_float_features object user)
SUFFIXES=(.predictor.local .predictor.precompute.prepare_float_features .predictor.precompute.remote_object_only .predictor.precompute.remote_request_only)

for i in "${!MODULES[@]}"; do
MODULE=${MODULES[i]}
SUFFIX=${SUFFIXES[i]}
buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkAB --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --otherNetFile=/data/users/$USER/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true
```

Rollback Plan:

Reviewed By: henryoier

Differential Revision: D80062244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160454
Approved by: https://github.com/henryoier
2025-08-15 03:42:39 +00:00
214d04833a [PT2]: Add Static Dispatch Kernel for fmod.Scalar (#160654)
Summary: Add static dispatch for torch.ops.aten.fmod.Scalar. Found this missing in user/object nets for DSNN models.

Test Plan:
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=892669089
SNAPSHOT_ID=36
MODULE=user
SUFFIX=.predictor.precompute.remote_request_only

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkEnableProfiling=true --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=1000
```

Object tower: P1904347784
User tower: P1904348406

Rollback Plan:

Differential Revision: D80238495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160654
Approved by: https://github.com/henryoier
2025-08-15 03:11:48 +00:00
9c5601ecc3 [NVIDIA] Refactor Family Blackwell Support codegen (#156176)
With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1.
This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced.
Thor 10.1 --> 11.0
Spark 12.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156176
Approved by: https://github.com/ezyang
2025-08-15 02:51:26 +00:00
5b9ad951f8 [BE][Docker] Do not install cuda:11.8 (#160695)
As CUDA-11.8 binary are no longer produced by CD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160695
Approved by: https://github.com/huydhn
2025-08-15 02:23:04 +00:00
4d5f92aa39 typing tvm.py (#160369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160369
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367, #160368
2025-08-15 02:09:31 +00:00
39ca0ce0c8 Type backend torchxla (#160368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160368
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365, #160366, #160367
2025-08-15 02:09:31 +00:00
d52bb67ac3 typing registry.py (#160367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160367
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365, #160366
2025-08-15 02:09:31 +00:00
05b9b63fb6 typing inductor and placeholder backends (#160366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160366
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363, #160364, #160365
2025-08-15 02:09:31 +00:00
453cfa5153 typing distributed.py (#160365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160365
Approved by: https://github.com/StrongerXi
ghstack dependencies: #160362, #160363, #160364
2025-08-15 02:09:31 +00:00
9faca5f260 typing debugging.py (#160364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160364
Approved by: https://github.com/Skylion007
ghstack dependencies: #160362, #160363
2025-08-15 02:09:31 +00:00
6fe6dd9fdc Type cudagraphs.py (#160363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160363
Approved by: https://github.com/StrongerXi
ghstack dependencies: #160362
2025-08-15 02:09:31 +00:00
f82c7eed84 Typing for common.py (#160362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160362
Approved by: https://github.com/Skylion007
2025-08-15 02:09:31 +00:00
25ccc4716e [Inductor] [Triton] Apply feedback to Enable padded stride support (#160614)
Summary:
Issue I noticed while fixing tests for TMA store. This triton.language.make_tensor_descriptor call hardcodes the shape information as the stride, which is not necessarily correct.

In particular, its legal to have a stride bigger than the shape (e.g. padded to a size). A good example of the usage of this would be to allocate a tensor to always be a multiple of 16 and just pad the result so TMA is legal.

This is redo of https://github.com/pytorch/pytorch/pull/160493 because I broke this accidentally trying to land internally first instead of merging through Github directly.

Test Plan:
Tested with `buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.nvcc_arch=h100 caffe2/test/inductor:max_autotune 2>&1 | tee ~/test_logs.log` and confirmed all max autotune tests passed.

Rollback Plan:

Differential Revision: D80224578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160614
Approved by: https://github.com/eellison
2025-08-15 02:06:14 +00:00
d387a48c38 [generator] Raise StopIteration(value) with value from the return stmt (#157152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157152
Approved by: https://github.com/zou3519
ghstack dependencies: #157148
2025-08-15 01:42:40 +00:00
831e85104a [contextlib] Fixes for CPython contextlib tests (#157148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157148
Approved by: https://github.com/zou3519
2025-08-15 01:42:40 +00:00
211c98859a [inductor][triton] Update triton_builtin handling after triton # 7239 (#160658)
https://github.com/triton-lang/triton/pull/7239 will search for a _semantic kwarg in the signature of the function before passing in this kwarg. To fix this in Inductor:

1. explicitly take a _semantic kwarg
2. remove the functools.wraps around the wrapper function, which was causing inspect.signature to return the signature of the wrapped function (instead of the signature of the wrapper, which does contain the _semantic arg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160658
Approved by: https://github.com/PaulZhang12, https://github.com/njriasan
2025-08-15 00:39:24 +00:00
dae7710bf2 [cuda][cupy] Improve cupy device placement when device is provided with explicit index (#158529)
resubmit https://github.com/pytorch/pytorch/pull/158320 , fixing a potential bug when device index is not specified explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158529
Approved by: https://github.com/ezyang
2025-08-15 00:27:42 +00:00
dc194a3096 Test multiprocessing spawn timing fix (#160672)
Submitting PR to fix #160511.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160672
Approved by: https://github.com/mikaylagawarecki
2025-08-15 00:11:55 +00:00
4051b42c29 [ROCm] hipify needs specific header mappings (#160675)
Fixes #160579.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160675
Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-15 00:09:04 +00:00
eb0eaa67e1 [BE][ci] Increase frequency of cutlass backend ci (#160656)
* increase frequency from every 24 hours to every 12 hours
* automatically enable it if cutlass backend files are touched.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160656
Approved by: https://github.com/eellison
2025-08-14 23:44:55 +00:00
98373e5ad2 [doc] AOTI debugging guide (#160430)
Folded from https://discuss.pytorch.org/t/a-beginners-guide-to-debugging-aot-inductor-cuda-illegal-memory-access/222188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160430
Approved by: https://github.com/angelayi
2025-08-14 23:42:17 +00:00
371eacb2ae [Dynamo][Hierarchical Compile] Refactor for tuple flattening (#158810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158810
Approved by: https://github.com/StrongerXi
2025-08-14 22:45:44 +00:00
3650989e6e Revert "[cutlass] fix dictionary iteration error (#160552)"
This reverts commit 29d20d49f0b7f4e362e1cefdcdc4b5659969312c.

Reverted https://github.com/pytorch/pytorch/pull/160552 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160552#issuecomment-3189940880))
2025-08-14 21:41:28 +00:00
3be70dc30e [inductor] dont reuse buffers if it affects peak (#145883) (#159530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159530
Approved by: https://github.com/eellison
2025-08-14 21:14:36 +00:00
47a1db823d [triton_heuristics] Optimize the triton launcher in pt2 (#160000)
Summary:

(Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent)

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238}

After Triton pin-update:
 {F1980557240}

The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.893x
```

```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```

After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.985x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000
Approved by: https://github.com/jansel, https://github.com/mlazos

Co-authored-by: Xu Zhao <xzhao9@meta.com>
2025-08-14 21:04:08 +00:00
eac2d9d695 Revert "appending the pythonpath (#160219)"
This reverts commit 1d80d697a269234b47ec7ede192faf3bb9b159e3.

Reverted https://github.com/pytorch/pytorch/pull/160219 on behalf of https://github.com/clee2000 due to broke inductor? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16970222746/job/48108262003) [HUD commit link](1d80d697a2) ([comment](https://github.com/pytorch/pytorch/pull/160219#issuecomment-3189850381))
2025-08-14 20:58:14 +00:00
3fe19a7a0a [Test Fix] Delete dynamo skipfile for OpenMP test_one_thread (#160562)
Fixes #120648

During issue scrubbing I could not repro these failing tests, so reenabling them to close out the issue

### Test
Original repro command:
```
 PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_openmp.py -v -k test_one_thread
```

Now results in
```
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0 -- /home/lucaskabela/.conda/envs/pytorch-3.12/bin/python3.12
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /home/lucaskabela/pytorch
configfile: pytest.ini
plugins: hypothesis-6.138.0
collected 2 items / 1 deselected / 1 selected
Running 1 items in this shard

test/test_openmp.py::TestOpenMP_ParallelFor::test_one_thread PASSED [3.6874s]                                                       [100%]

===================================================== 1 passed, 1 deselected in 6.07s =====================================================
```

And:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_openmp.py TestOpenMP_ParallelFor.test_one_thread
```
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_sort_and_select.py TestSortAndSelectCPU.test_sort_overflow_cpu_int16
```

Both result in:
```
.
----------------------------------------------------------------------
Ran 1 test in 0.003s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160562
Approved by: https://github.com/zou3519
2025-08-14 20:55:59 +00:00
4a90dc0c1f Update checkpoint warning to target PyTorch 2.9 (#160643)
Fixes #160534

Updates the warning in torch.utils.checkpoint to state that starting in PyTorch 2.9, calling checkpoint without explicitly passing use_reentrant will raise an exception. Follows the guidance from the issue discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160643
Approved by: https://github.com/soulitzer
2025-08-14 20:53:17 +00:00
1fc683cf17 [Inductor] Allow indexing a flexible layout for extract_input_node_reduction_ranges (#160645)
Differential Revision: D79831747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160645
Approved by: https://github.com/eellison
2025-08-14 20:43:35 +00:00
b9d7de3a09 [Inductor] addmm + activation function fusion (#158137)
PR implements a pass in post_grad to fuse activation(add + mm)

This was previously done similarly here #106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion.

however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results

perf dash board
<img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" />

Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API

Graph module before and after this pass

Relu(addmm)
```
graph():
    %primals_1 : [num_users=1] = placeholder[target=primals_1]
    %primals_2 : [num_users=2] = placeholder[target=primals_2]
    %primals_3 : [num_users=2] = placeholder[target=primals_3]
    %addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
    %relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {})
    %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {})
    %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
    return (relu, primals_2, le, permute_1)
graph():
    %primals_1 : [num_users=1] = placeholder[target=primals_1]
    %primals_2 : [num_users=2] = placeholder[target=primals_2]
    %primals_3 : [num_users=2] = placeholder[target=primals_3]
    %_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
    %le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {})
    %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
    return (_addmm_activation_default, primals_2, le, permute_1)
```
Gelu (addmm)
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {})
    %mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {})
    %mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {})
    %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {})
    %mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {})
    %tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {})
    %add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {})
    %mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {})
    return (mul_5,)
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True})
    return (_addmm_activation_default,)
```

Benchmark setup:
NGC pytorch 25.06 container
cublas version: 12.9.1.4
torch.compile ran with dynamic = False and max_autotune

H100
```
Testing with M=1024, N=1024, K=1024, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.0107 ms
Average Time per Iteration (torch compile):	 0.0296 ms

============================================================
Testing with M=2048, N=2048, K=2048, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.0262 ms
Average Time per Iteration (torch compile):	 0.0327 ms

============================================================
Testing with M=4096, N=4096, K=4096, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 0.1763 ms
Average Time per Iteration (torch compile):	 0.2457 ms

============================================================
Testing with M=8192, N=8192, K=8192, dtype=bfloat16
============================================================
Average Time per Iteration (cublas):	 1.5280 ms
Average Time per Iteration (torch compile):	 1.9437 ms
```

A100
```
############################################################
Testing with dtype: float16
############################################################

============================================================
Testing with M=1024, N=1024, K=1024, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.0313 ms
Average Time per Iteration (torch compile):	 0.0643 ms

============================================================
Testing with M=2048, N=2048, K=2048, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.1149 ms
Average Time per Iteration (torch compile):	 0.1255 ms

============================================================
Testing with M=4096, N=4096, K=4096, dtype=float16
============================================================
Average Time per Iteration (cublas):	 0.6297 ms
Average Time per Iteration (torch compile):	 0.7547 ms

============================================================
Testing with M=8192, N=8192, K=8192, dtype=float16
============================================================
Average Time per Iteration (cublas):	 4.3821 ms
Average Time per Iteration (torch compile):	 5.0740 ms
```

Script
```py
import torch
torch.manual_seed(0)

warmup, numrun= 10, 100

sizes = [1024, 2048, 4096, 8192]
dtypes = [torch.float16, torch.bfloat16, torch.float32]

device = torch.device("cuda")

for dtype in dtypes:
    dtype_name = str(dtype).split('.')[-1]
    print(f"\n{'#'*60}")
    print(f"Testing with dtype: {dtype_name}")
    print(f"{'#'*60}")

    for size in sizes:
        M, N, K = size, size, size
        print(f"\n{'='*60}")
        print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}")
        print(f"{'='*60}")

        A = torch.randn(M, K, device=device, dtype=dtype)
        B = torch.randn(K, N, device=device, dtype=dtype)
        C = torch.randn(M, device=device, dtype=dtype)

        def func1():
            return torch._addmm_activation(C, A, B, use_gelu=True)

        def func2():
            return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh")

        func2_compiled = torch.compile(
            func2,
            dynamic=False,
            options={
                "force_disable_caches": True,
                "max_autotune": True,
                "max_autotune_gemm": True,
                "max_autotune_gemm_backends": "TRITON",
                "autotune_fallback_to_aten": False,
            }
        )

        for _ in range(warmup): func1()
        torch.cuda.synchronize(device=device)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        total_time_ms = 0.0
        start_event.record()
        for _ in range(numrun): func1()
        end_event.record()
        torch.cuda.synchronize(device=device)
        total_time_ms += start_event.elapsed_time(end_event)
        avg_time_ms = total_time_ms / numrun

        print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms")

        for _ in range(warmup): func2_compiled()
        torch.cuda.synchronize(device=device)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        total_time_ms = 0.0
        start_event.record()
        for _ in range(numrun): func2_compiled()
        end_event.record()
        torch.cuda.synchronize(device=device)
        total_time_ms += start_event.elapsed_time(end_event)
        avg_time_ms = total_time_ms / numrun

        print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158137
Approved by: https://github.com/eellison
2025-08-14 20:41:38 +00:00
1028c5e2d5 [Dynamo] Add CPython default dict tests (#155263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155263
Approved by: https://github.com/zou3519
2025-08-14 20:22:22 +00:00
19b4283884 Typo correction in variable name uninitalized_val in resize() function (#160636)
Fixes #160633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160636
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2025-08-14 20:11:43 +00:00
8d6d324631 [Dynamo][Hierarchical-Compile] Don't allow node duplicates to be added (#160605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160605
Approved by: https://github.com/StrongerXi
2025-08-14 20:02:10 +00:00
fdfd69bb05 Set PYTHONHOME for inductor subprocesses using torch (#160008)
This is needed for subprocesses that are trying to call back into torch functionality, i.e. anything that's also setting `PYTHONPATH`.  If they're part of an application that bundles the Python runtime, then they should use the bundled runtime to keep their view of the world consistent.

There are more `sys.executable` subprocesses in torch/ but it seems like they're fine.

Previous PR at https://github.com/pytorch/pytorch/pull/159382, but was reverted because it caused macOS jobs on GitHub to timeout.  What was happening was inductor subprocesses were scheduling C++ compilation tasks that were failing to find the Python.h header.  This was because they were running in venvs and now trying to find the CPython headers inside the venv, where the headers do not exist.  This PR gates the new behavior to internal builds only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160008
Approved by: https://github.com/aorenste
2025-08-14 19:57:14 +00:00
0d3461bac0 DOC: update CrossEntropyLoss with note and example of incorrect target specification (#155649)
Fixes #134771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155649
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-08-14 18:34:57 +00:00
65053c03a3 [FR] Don't check incomplete ranks for printing (#160195)
When just printing the ranks (`-j` option) we should skip the check for "incomplete ranks" since that doesn't affect the print

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160195
Approved by: https://github.com/fduwjj
ghstack dependencies: #160097
2025-08-14 18:19:45 +00:00
96f9fbe21a Fix flight recorder for P2P ops (#160097)
Fixes errors in debugging a trace as mentioned in https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160097
Approved by: https://github.com/fduwjj
2025-08-14 18:19:45 +00:00
1c25871191 Allow torch.hub.load with unauthorized GITHUB_TOKEN (#159896)
Allow torch.hub.load with unauthorized GITHUB_TOKEN

`torch.hub.load` fails if a `GITHUB_TOKEN` with few permissions is set, as can be seen in the following example. Make sure that the model has not been cached before, for example with `rm ~/.cache/torch`. If the model has been downloaded already, it will not be downloaded again and the authorization error will not occur.

```python
export GITHUB_TOKEN=""
python
>>> import torch
>>> torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 567, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 231, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, ref)
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 191, in _validate_not_a_forked_repo
    response = json.loads(_read_url(Request(url, headers=headers)))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/site-packages/torch/hub.py", line 174, in _read_url
    with urlopen(url) as r:
         ^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "~/miniconda3/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
```

The cause of the error is that the function `_validate_not_a_forked_repo` in `hub.py` always uses `GITHUB_TOKEN` for authorization,  even when downloading does not require authorization.

0ba09a6d34/torch/hub.py (L194)

This fix simply retries the download without the token in case of a failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159896
Approved by: https://github.com/albanD
2025-08-14 18:15:49 +00:00
6c05ea6475 [DTensor] add op support: aten.squeeze_.dim (#159532)
**Summary**
This PR enables in-place op `aten.squeeze_.dim` on DTensor with a change to
DTensor dispatch logic: when processing in-place operator, we should assign
`output_sharding.output_spec` back to the first argument. This is because
the in-place op_call on `arg._local_tensor` could also shift the tensor meta.

**Test**
`pytest test/distributed/tensor/test_view_ops.py -s -k  test_squeeze_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159532
Approved by: https://github.com/zpcore
2025-08-14 18:01:19 +00:00
5665dc9ab7 [PP] Allow larger world_size schedule tests (#160559)
Update schedule tests to use `world_size=4`, changes needed:
- Move some tests that require world_size=2 to new class
- Move helper methods from class level to function level
- Update some initialization to pass assert since gradients were super small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160559
Approved by: https://github.com/wconstab
ghstack dependencies: #159591, #160558
2025-08-14 17:41:58 +00:00
2ff7c1c774 [PP] Rename _load_actions and validate (#160558)
Rename method and add validation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160558
Approved by: https://github.com/wconstab
ghstack dependencies: #159591
2025-08-14 17:41:58 +00:00
3028fa6ce9 Wrap class definitions in set_fullgraph(False) in test_list/tuple (#160277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160277
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276, #160278, #160330, #160331
2025-08-14 17:29:45 +00:00
077cb38974 Add dtype checks in meta dispatch for various ordering ops (#159556)
This adds data type checks for the unsupported bool and complex types for argmax/min topk, sort, minimum, maximum. As listed here:

0a99b026d6/torch/testing/_internal/common_methods_invocations.py (L21076)

Currently the ops will fail on CPU or CUDA calculation, rather than at meta dispatch stage as with for example max: 0a99b026d6/aten/src/ATen/native/TensorCompare.cpp (L285) . This will catch it early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159556
Approved by: https://github.com/janeyx99
2025-08-14 17:06:27 +00:00
cd8d8c18f5 [pytorch][dynamo_compile] Log graph_node_shape to dynamo_compile (#160556)
This PR adds the dynamo graph node shape logging to dynamo compile. Also added unit tests to check if correct graph node shape is being logged.

Test Plan:
$ python -m test_utils
Ran 12 tests in 36.447s
OK

Note: Will merge after D80185628 lands.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160556
Approved by: https://github.com/masnesral, https://github.com/jingsh
2025-08-14 16:42:35 +00:00
63654ba4c5 [BE][Dynamo] Type improvements in _dynamo/utils to generics (#159824)
Follow up to #159580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824
Approved by: https://github.com/williamwen42
2025-08-14 16:06:50 +00:00
7e27347fd3 [SymmMem] Check return of nvshmem_malloc (#160603)
`nvshmem_malloc` returns a null pointer when allocation fails. We should check here.
Otherwise, the nullptr can go down the road and into the device kernel, causing CUDA illegal memory access.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160603
Approved by: https://github.com/fduwjj, https://github.com/ngimel
2025-08-14 15:57:55 +00:00
1d80d697a2 appending the pythonpath (#160219)
Fixes #160193

`PYTHONPATH=/torchbench` to `PYTHONPATH=/torchbench:$PYTHONPATH` in [pytorch/.ci/pytorch/test.sh](b5fd7223b1/.ci/pytorch/test.sh (L1715))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160219
Approved by: https://github.com/malfet
2025-08-14 15:55:31 +00:00
b6b74aed60 [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jeffdaily
2025-08-14 15:09:16 +00:00
4a773e1e86 Warn when there is side effect in strict mode (#160060)
Differential Revision: [D79784354](https://our.internmc.facebook.com/intern/diff/D79784354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160060
Approved by: https://github.com/zhxchen17, https://github.com/StrongerXi
2025-08-14 14:59:44 +00:00
198b5fd2d4 [PP] Add DualPipeV schedule (#159591)
Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11

<img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" />

This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs.

Tests:
```sh
python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules
python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules
```

Also tested in TorchTitan and is running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591
Approved by: https://github.com/wconstab
2025-08-14 14:58:35 +00:00
20bdabbb3c [Dynamo] Fix MTIA dynamo backend by avoiding has_trition() at import time (#160604)
# Summary
MTIA's torch.compile tests were broken by D80037015. (For details, see internal task T234563969.) The root cause was that `has_triton` can change state after we call `torch.mtia.init()`, but it was used in a way that fixes Inductor's behavior at import time. (Note that `has_triton` is cached, and there's no opportunity to call `torch.mtia.init()` prior to `import torch`.)

To fix this, we use `try: import triton` as opposed to `has_triton()` at the module level.

# Test Plan

See the internal diff. As a follow-up, we will add appropriate unit tests and/or CI hints so this type of issue can be caught at PR/diff time.

Differential Revision: D80228000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160604
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
2025-08-14 14:54:49 +00:00
d556586448 [cutlass backend] re-add pip cutlass path (#160180)
Revert #156651 to allow using the cutlass PIP package which is easier for users than the Git checkout or similar method.

Also fix a bug where the PIP cutlass path wouldn't be available to subprocesses spawned during benchmarking for algorithm selection. Looks like the "spawn" method does not inherit the (potentially) already set up `config.cuda.cutlass_dir` so in the subprocess the include paths will still be set to `"../third_party/cutlass/"` leading to compilation failure due to missing headers.

Ensure `try_import_cutlass` is called at that point, which due to caching is a no-op in most cases, so doesn't hurt.
Change the logic to return `None` when cutlass isn't available returning more useful values for include paths, namely an empty list. This is in line with other inductor code which disables the CUTLASS backend when `try_import_cutlass` returns False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160180
Approved by: https://github.com/henrylhtsang, https://github.com/mlazos
2025-08-14 14:48:31 +00:00
781e9a7724 Fix meta for constant_pad_nd (#159878)
Fixes https://github.com/pytorch/pytorch/issues/144187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159878
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-14 14:47:47 +00:00
e4de93f6a3 Add sm50 and sm60 back to windows builds (#160586)
Addresses the issue reported in  https://github.com/pytorch/pytorch/issues/160575
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160586
Approved by: https://github.com/malfet
2025-08-14 12:46:35 +00:00
a5652407e4 [CI] Fix triton xpu build on Windows (#160442)
Pin the ninja version to 1.11

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160442
Approved by: https://github.com/atalman
2025-08-14 12:43:49 +00:00
6f0f4e0c3e reduce threshold to suggest changes to expected results (#160463)
Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463
Approved by: https://github.com/jamesjwu
2025-08-14 09:11:27 +00:00
db763b1717 [Intel GPU] Support SDPA backend selection and priority setting on XPU (#159464)
Currentlly SPDA XPU use own `priority_order` instead of the one from global context. Hence it does not support `with sdpa_kernel(order, set_priority=True)` with set_priority=True.

This PR enables this feature. To make default `priority_order` from global context works for XPU, I also move MATH backend to lowest priority, otherwise `cudnn attention` and `overrideable attention` will never be selected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159464
Approved by: https://github.com/guangyey, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: mayuyuace <qiming1.zhang@intel.com>
2025-08-14 08:55:31 +00:00
089c4a1ba0 Fix wrong log file name in the docs of torch.distributed.elastic.multiprocessing.start_processes() (#160396)
Fixes #160395

In https://docs.pytorch.org/docs/stable/elastic/multiprocessing.html#starting-multiple-workers and also in the code comment of the function[1], it was specified that:

```
    For each process, the ``log_dir`` will contain:

    #. ``{local_rank}/error.json``: if the process failed, a file with the error info
    #. ``{local_rank}/stdout.json``: if ``redirect & STDOUT == STDOUT``
    #. ``{local_rank}/stderr.json``: if ``redirect & STDERR == STDERR``
```

While in code[2], the files are `stdout.log` and `stderr.log`, instead of the `.json` ones listed in the doc.

[1]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/__init__.py#L144-L145
[2]: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L354-L357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160396
Approved by: https://github.com/fduwjj
2025-08-14 08:24:07 +00:00
97c8c98f8d measure dispatch overhead (#160504)
Reopen https://github.com/pytorch/pytorch/pull/159699 to merge to main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160504
Approved by: https://github.com/wconstab
2025-08-14 06:13:53 +00:00
39aa3d1471 Remove the dead code in setup.py (#160515)
The following line has no effect.

34ec5ed275/setup.py (L1205)

This code was originally introduced in this PR: dd7cec680c,
and clang11 and later now support `-fstack-clash-protection`. Can we remove this line?

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160515
Approved by: https://github.com/isuruf, https://github.com/albanD
2025-08-14 06:02:11 +00:00
639778b3ee [2/3 step][ vllm ci build setup] Add vlllm buld logic and dockerfile (#160089)
# set up vllm build logic
- dockerfile:  please notice the dockfile introduced here is only temporary, once we migrate this file to vllm, we will fetch it directly from there
- VllmBuildRunner:
   - implement logic to prepare and run vllm build with dockerfile
   -

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160089
Approved by: https://github.com/huydhn
ghstack dependencies: #160043
2025-08-14 05:51:45 +00:00
00d7d6f123 [1/3][ghstack] [vllm ci build setup ]setup lumen_cli (#160043)
# Description
set up torch_cli using argparses

## Details:
- add vllm placeholer in the cli
- add unittest for cli command

see Readme.md to see how to run the cli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160043
Approved by: https://github.com/huydhn
2025-08-14 05:51:45 +00:00
c6d78d4dbd [ROCm] enable miopen channels last 3d for conv and batchnorm (#160529)
miopen batchnorm for channels last is guarded by env var PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM similar to existing PYTORCH_MIOPEN_SUGGEST_NHWC for conv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160529
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-14 05:30:19 +00:00
2898d3f965 [Lowering] Add assertion msg to sym_size and sym_stride (#160591)
Summary: Add assertion msg to sym_size and sym_stride lowering function.

Test Plan:
Will test in mast job.

Rollback Plan:

Differential Revision: D80187693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160591
Approved by: https://github.com/angelayi
2025-08-14 04:55:32 +00:00
34358f335d [vllm hash update] update the pinned vllm hash (#160594)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160594
Approved by: https://github.com/pytorchbot
2025-08-14 04:21:28 +00:00
fe3f5fe4ea Optimize min, max gradient behavior description (#160312)
Fixes #160273

## Test Result
<img width="897" height="593" alt="image" src="https://github.com/user-attachments/assets/6ebcdb2c-8a2c-4f0d-8195-656089e88325" />
<img width="985" height="653" alt="image" src="https://github.com/user-attachments/assets/606a7264-e223-4d2b-8c3f-f153ce43b208" />
<img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/0ae2f56f-820f-4194-b15c-a02a078c0487" />
<img width="903" height="607" alt="image" src="https://github.com/user-attachments/assets/79c38a17-45ac-4808-829f-d538178de36b" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160312
Approved by: https://github.com/ngimel
2025-08-14 04:18:49 +00:00
45ba7ecda8 Flex Attention heuristics: a Blackwell config (#160192)
Fixes #160074 and more.

This is the working config for B200 and RTX 5080.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160192
Approved by: https://github.com/drisspg
2025-08-14 03:47:02 +00:00
194fcfcfbd Add support for param mutation under inference mode (#159661)
Summary:
In HF model rwkv, we have parameter mutation under inference mode which should be safe. This PR does multiple things to make sure it works:
1. We execute global autograd mutation while tracing so that we can actually trace through parameter inplace mutation
2. Add support for parameter mutation under inference mode in AOTAutograd
3. Add support for parameter mutation under inference mode in export.

Test Plan:
test

Rollback Plan:

Differential Revision: D79460136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159661
Approved by: https://github.com/ydwu4
2025-08-14 03:34:04 +00:00
29d20d49f0 [cutlass] fix dictionary iteration error (#160552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160552
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
2025-08-14 03:23:46 +00:00
3faee0a631 Update nullcontext to return input args (#158776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158776
Approved by: https://github.com/zou3519
2025-08-14 03:02:44 +00:00
8cfaf51d4e Generalize support of background thread in pinned allocator (#160505)
# Motivation
https://github.com/pytorch/pytorch/pull/135524 only introduces the support of background thread for CUDA, this PR intends to support it for other backend such as XPU as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160505
Approved by: https://github.com/albanD
2025-08-14 02:22:39 +00:00
af3cabc55d Wrap class definitions in set_fullgraph(False) in test_sort (#160331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160331
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276, #160278, #160330
2025-08-14 02:12:20 +00:00
74bbe7b4a3 Wrap class definitions in set_fullgraph(False) in test_math/cmath (#160330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160330
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276, #160278
2025-08-14 02:12:20 +00:00
7bfc424a61 Wrap class definitions in set_fullgraph(False) in test_iter (#160278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160278
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #160216, #160217, #160276
2025-08-14 02:12:20 +00:00
5ace061254 finfo eps doc fix (#160502)
Existing documentation for torch.finfo().eps is as below:
| eps             | float  | The smallest representable number such that ``1.0 + eps != 1.0``.          |

Proposed documentation for torch.finfo().eps is as below:
| eps             | float  | The difference between 1.0 and the next smallest representable float larger than 1.0.	|

Fixes #160397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160502
Approved by: https://github.com/ngimel
2025-08-14 01:49:35 +00:00
15e49f6164 Factor out the strings to templates for better editor integration (#160357)
# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-14 01:07:53 +00:00
dd21c8a578 refresh expected results (#160537)
regression introduced  by https://github.com/pytorch/pytorch/pull/160314
not much worried about it since it did not effect other inductor benchmarks could not repo locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537
Approved by: https://github.com/eellison
2025-08-14 00:56:14 +00:00
a06ec54d40 [MPS] Add API to query GPU core count (#160414)
Using good old IOKit to get `gpu-core-count` property from device implementing `AGXAccelerator` service
Expose this one as `torch.backend.mps.get_core_count()` and make it accessible via `MpsInterface` to the inductor

Test Plan: Run `python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"` and compare it to `system_profiler SPDisplaysDataType|head -n10`
```
% python3 -c "import torch;print(torch.backends.mps.get_name(), torch.backends.mps.get_core_count())"
Apple M1 Pro 16
% system_profiler SPDisplaysDataType|head -n10
Graphics/Displays:

    Apple M1 Pro:

      Chipset Model: Apple M1 Pro
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 16
      Vendor: Apple (0x106b)
      Metal Support: Metal 3
```

This would significantly improve occupancy for torch.compile generated kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160414
Approved by: https://github.com/dcci
2025-08-14 00:05:17 +00:00
50a8c11875 Add getCurrentDeviceIndex to torch::stable::accelerator (#160453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160453
Approved by: https://github.com/janeyx99
ghstack dependencies: #159679
2025-08-13 23:42:24 +00:00
e4e4dbd2f8 Add beginnings of torch::stable::accelerator (#159679)
Adds
- `torch::stable::accelerator::DeviceGuard`: `std::unique_ptr` to `DeviceGuardOpauqe` mostly copied from the below (but made generic)

   50eac811a6/torch/csrc/inductor/aoti_runtime/utils_cuda.h (L30-L46)
    - constructor `DeviceGuard(DeviceIndex)` (**this matches aoti but defers from the actual c10 DeviceGuard constructor that takes in device**)
    - `set_index(DeviceIndex)`
- `torch::stable::accelerator::Stream`: `std::shared_ptr` to `StreamOpaque`
     - constructor `Stream(StreamHandle stream)` (similar to torch::stable::Tensor)
     - `id() -> StreamId`

- `getCurrentStream(DeviceIndex device_index) -> stable::accelerator::Stream`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159679
Approved by: https://github.com/guangyey, https://github.com/janeyx99
2025-08-13 23:42:24 +00:00
d670304001 [ATen][CUDA] Use new CCCL API in v2.8 (#160554)
Silences deprecation warnings like:
```
In file included from tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:1:
/tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c: At global scope:
/tmp/tmpxft_003a195d_00000000-6_Nonzero.cudafe1.stub.c:243:219: warning: 'template<class ValueType, class OffsetT> class at_cuda_detail::cub::CountingInputIterator' is deprecated: Use thrust::counting_iterator instead [-Wdeprecated-declarations]
  243 | static void __device_stub__ZN2at6native43_GLOBAL__N__3cee4041_10_Nonzero_cu_cba1aaa011flag_kernelILi512ELi16EhEEvPKT1_PlPKllli( const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE *__par0,  int64_t *__par1,  const int64_t *__par2,  int64_t __par3,  int64_t __par4,  int __par5) {  __cudaLaunchPrologue(6); __cudaSetupArgSimple(__par0, 0UL); __cudaSetupArgSimple(__par1, 8UL); __cudaSetupArgSimple(__par2, 16UL); __cudaSetupArgSimple(__par3, 24UL); __cudaSetupArgSimple(__par4, 32UL); __cudaSetupArgSimple(__par5, 40UL); __cudaLaunch(((char *)((void ( *)(const _ZN3c104impl20ScalarTypeToCPPTypeTILNS_10ScalarTypeE0EEE *, int64_t *, const int64_t *, int64_t, int64_t, int))at::native::_NV_ANON_NAMESPACE::flag_kernel<(int)512, (int)16, unsigned char> ))); }namespace at{
      |                                                                                                                                                                                                                           ^~~~~~~~~~~~~~~~~~~~~
/usr/local/cuda-12.9/include/cub/iterator/counting_input_iterator.cuh:93:63: note: declared here
   93 | class CCCL_DEPRECATED_BECAUSE("Use thrust::counting_iterator instead") CountingInputIterator
      |                                                               ^~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160554
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman
2025-08-13 23:15:53 +00:00
c5efc5c8a6 Fix unit test test_equivalent_template_code (#160432)
Summary:
Fix unit test test_equivalent_template_code

https://github.com/pytorch/pytorch/pull/159920 treats  ReinterpretView as a not-realized node when searching FX origin nodes for fused triton kernel. In test_equivalent_template_code, there is a transpose node (which is a ReinterpretView) before matmul. It was not in FX graph segment before PR 159920. FX origin nodes are used to define the name of triton kernel. That is the reason test_equivalent_template_code failed with PR 159920 since it uses hard-coded triton kernel name to check the result. The fix is to update the triton kernel name in the unit test.

Test Plan:
buck2 run mode/opt caffe2/test/inductor:benchmark_fusion -- caffe2.test.inductor.test_benchmark_fusion.BenchmarkMultiTemplateFusionCudaTest

Rollback Plan:

Differential Revision: D80101711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160432
Approved by: https://github.com/clee2000
2025-08-13 23:14:51 +00:00
6da11d9aaf [C10D] Add check_rng_sync util (#160283)
Debugs RNG desync by checking the current state on each rank in the group and summarizing the differences if any are detected.

Notes:
- used allgather instead of gather since its simpler to do this SPMD rather than add conditional behavior, though I could be convinced we only want to log on rank0.

Usage:
`check_rng_sync(generator, group)`

Prints something like this:

(cuda):
```
[rank0]:E0808 ] Generator desync detected:
[rank0]:E0808 ] Ranks    (Seed, Offset) values
[rank0]:E0808 ] -------  -----------------------
[rank0]:E0808 ] 0        (456, 0)
[rank0]:E0808 ] 1        (123, 4)
[rank0]:E0808 ] 2-3      (123, 0)
```

(cpu):
```
[rank2]:E0810 ] Generator desync detected:
[rank2]:E0810 ] Ranks      Generator State Hash values
[rank2]:E0810 ] -------  -----------------------------
[rank2]:E0810 ] 0                  7633364531954955665
[rank2]:E0810 ] 1                  8807615394212033278
[rank2]:E0810 ] 2-3               -6150027303226666531
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160283
Approved by: https://github.com/ezyang
2025-08-13 23:05:29 +00:00
182efe31db [inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462
Approved by: https://github.com/eellison
2025-08-13 22:54:18 +00:00
1ea688f9a2 [dynamo] fix EXTENDED_ARG starts_line dropping bug (#160478)
Fixes https://github.com/pytorch/pytorch/issues/160471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160478
Approved by: https://github.com/Lucaskabela, https://github.com/billmguo
2025-08-13 22:27:40 +00:00
53e3949495 [MTIA-T][CFF] Pass backend parameter into GPU vertical pass file and pattern matcher (#160404)
Summary:
As titled
Please see https://fb.workplace.com/groups/1075192433118967/posts/1735215827116621/?comment_id=1735220747116129&reply_comment_id=1735242997113904

Basically, for MTIA, we want mtia_afg to show up in the counters and backend, instead of Inductor. MTIA is not using inductor yet. Using env var TORCHINDUCTOR_PATTERN_MATCH_BACKEND to pass in the actual backend.

The env var default value is "inductor", so nothing should break for GPU.

Test Plan:
Default is always "inductor", so existing test should not break.

CI tests

Rollback Plan:

Differential Revision: D80069072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160404
Approved by: https://github.com/BoyuanFeng
2025-08-13 22:24:27 +00:00
33d9401866 Revert "[BE][Dynamo] Type improvements in _dynamo/utils to generics (#159824)"
This reverts commit 3ef2e1ef769582a82c6ddf150e9d11bf4bf1c44f.

Reverted https://github.com/pytorch/pytorch/pull/159824 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_trace_rules.py::TraceRuleTests::test_almost_impossible_missing_name [GH job link](https://github.com/pytorch/pytorch/actions/runs/16948305999/job/48035192324) [HUD commit link](3ef2e1ef76) ([comment](https://github.com/pytorch/pytorch/pull/159824#issuecomment-3186003531))
2025-08-13 22:17:29 +00:00
d1950d4bb5 Change IR node's stack trace to be computed lazily (#160487)
Summary: When an IR node is an inherited class, post_init is called once for each super().__init__() call. To avoid duplicated calls, we make stack trace computation happen lazily.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80137870

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160487
Approved by: https://github.com/angelayi
2025-08-13 21:41:25 +00:00
1196bb1c2e Add utility to get computed kernel in torch.library (#158393)
Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key
- Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered).
- `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens
- `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel`

Related to https://github.com/pytorch/pytorch/issues/155330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158393
Approved by: https://github.com/albanD
2025-08-13 21:00:59 +00:00
e9eb2096a5 [cutlass backend] Allow bmm use cases when batch stride is 0 (#160356)
Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/)

The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only.

But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160356
Approved by: https://github.com/mlazos
2025-08-13 20:52:24 +00:00
3ef2e1ef76 [BE][Dynamo] Type improvements in _dynamo/utils to generics (#159824)
Follow up to #159580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159824
Approved by: https://github.com/williamwen42
2025-08-13 20:17:01 +00:00
4cde0acc0e Make triton build ROCm library version-agnostic (#158408)
Fixes maintenance of triton packaging script when library versions change from one ROCm version to next.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158408
Approved by: https://github.com/jeffdaily

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
2025-08-13 19:49:23 +00:00
70ccdec44b [ROCm] Improve reduction sum performance (#160466)
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128

**Reproducer:**
```
import time
import torch

shapes = [
    (5079670, 128)
]

dims = [
    (1)
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.float)
    for _ in range(10):
        w = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    print(w.size())

    start_time = time.time()
    for _ in range(50):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    end_time = time.time()
    mean_time = (end_time - start_time)/50
    print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us")
```

**Before (MI300X):**
Avg time for shape (5079670, 128): 1629.99 us

**After (MI300X)**
Avg time for shape (5079670, 128): 1008.59 us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160466
Approved by: https://github.com/petrex, https://github.com/jeffdaily
2025-08-13 18:46:58 +00:00
db0b7f1cc9 [BE][CI] Adjust error_inputs for cat and complex (#160378)
MPS backend does not support double, so errors should be different
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160378
Approved by: https://github.com/dcci
2025-08-13 18:35:06 +00:00
1c26c53851 Fix the Doc of pivot in torch.lu (#159617)
Fixes #159616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159617
Approved by: https://github.com/lezcano, https://github.com/jansel
2025-08-13 18:30:54 +00:00
adcca7d9a1 Do not rpath CUDA stubs folder in JIT generated code (#160179)
`_transform_cuda_paths` intentionally includes the CUDA stubs folder.

However this path must not be added to the rpath as otherwise any CUDA command will fail at runtime with
> CUDA_ERROR_STUB_LIBRARY: "CUDA driver is a stub library"

This results in e.g. non-descriptive errors like
```
cutlass_library/source/tools/util/include/cutlass/util/device_memory.h:67  cutlass::device_memory::allocate: cudaMalloc failed: bytes=4096
terminate called after throwing an instance of 'cutlass::cuda_exception'
  what():  std::exception
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160179
Approved by: https://github.com/jansel
2025-08-13 18:29:24 +00:00
01584d2a7d [ROCm] remove extra transposes in NHWC convolutions on MIOpen (#160435)
remove aten::contiguous for NHWC convolutions on ROCm

Tests:
- nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32
- nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16

Before:
<img width="1255" height="228" alt="image"
src="https://github.com/user-attachments/assets/b125ccab-00c2-4d3a-a341-4583e51d8d57" />

After:
<img width="874" height="153" alt="image"
src="https://github.com/user-attachments/assets/ec200754-3622-488e-8762-bff1c2d22818" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160435
Approved by: https://github.com/jeffdaily
2025-08-13 17:58:22 +00:00
87e6c4079d Fix the Doc issue on the description of edge_order in torch.gradient() (#159130)
Fixes #159129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159130
Approved by: https://github.com/soulitzer
2025-08-13 16:48:47 +00:00
7d87e358ac Fix MPS conv3d autocast bias dtype mismatch (#160423)
## Summary
- register conv3d with MPS autocast to ensure bias dtypes match under AMP
- add regression test chaining two Conv3d layers on MPS autocast

Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160423
Approved by: https://github.com/dcci
2025-08-13 16:23:21 +00:00
6ee175195a [DCP][OSS] Rank local checkpointing in DCP without collectives (#147758)
Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758
Approved by: https://github.com/meetv18
2025-08-13 16:20:28 +00:00
db32b60662 [ci] Add riscv opt-int build (#143979)
Hi, @malfet
Based on the previous discussion:

[RISCV CI support · Issue #141550 · pytorch/pytorch](https://github.com/pytorch/pytorch/issues/141550)

I have cross-compiled PyTorch for the RISC-V architecture on x86_64 Ubuntu 24.04 and created a new PR for it. Could you please help review it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143979
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-13 16:12:02 +00:00
56c828bef9 Followup of #160002, gracefully fail if Triton functions don't contain attributes (#160436)
Summary: Fixes internal test failures of D80037015

Test Plan:
CI

Rollback Plan:

Differential Revision: D80094187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160436
Approved by: https://github.com/clee2000
2025-08-13 16:04:56 +00:00
a2fd106d67 guard cuMulticastUnbind call (#160499)
Fixes builds for old compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160499
Approved by: https://github.com/Skylion007
2025-08-13 15:45:51 +00:00
c656334120 Revert "Factor out the strings to templates for better editor integration (#160357)"
This reverts commit cbffde774557752cf20447d42d99ec6102673c31.

Reverted https://github.com/pytorch/pytorch/pull/160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file  No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](https://github.com/pytorch/pytorch/pull/160357#issuecomment-3184435581))
2025-08-13 15:40:50 +00:00
31c9ac4319 [c10d] Fix test test_nccl_user_buffer_registration (#160497)
Fixed `test_nccl_user_buffer_registration ` due to https://github.com/pytorch/pytorch/pull/160145, somehow CI didn't capture it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160497
Approved by: https://github.com/ngimel
2025-08-13 15:29:41 +00:00
deea71a90e [ez][CI] Set timeout for linux-jammy-py3_13-clang12-test from 600min -> default val of 240 (#160500)
10 hours is very long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160500
Approved by: https://github.com/huydhn
2025-08-13 15:14:24 +00:00
114a6c4043 Add placeholder for the User Guide (#159379)
- Add pytorch_overview.md
- Add pytorch_main_components.md
- Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials
- Move notes under user guide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159379
Approved by: https://github.com/albanD

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-13 14:56:04 +00:00
ee1b0412b9 [1/N]Port 3 distributed/_tools test cases to Intel GPU (#159543)
For [#114850](https://github.com/pytorch/pytorch/issues/114850), we will port distributed tests to Intel GPU.

We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
2. enabled XPU for some test path
3. skip some test cases which Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159543
Approved by: https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-08-13 12:49:01 +00:00
42e51cd4b3 Support ddp zero hook XCCL path (#159240)
XCCL backend no https://github.com/pytorch/pytorch/issues/62300 issue, add xccl path here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159240
Approved by: https://github.com/guangyey, https://github.com/Skylion007, https://github.com/EikanWang
2025-08-13 12:37:33 +00:00
96bd33b2de Fix get_free_symbol_uses for several nodes (#160314)
get_free_symbol_uses is used to know what unbacked symbols are used by a given node.
not having correct get_free_symbol_uses defined properly leads to :

- eliminating of some nodes due to not detection of any users. (See the added unit test)
- Incorrect topological sort.

Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel.
for ComputedBuffer with NonOwningLayout its interesting case.
when layout is NonOwningLayout we need to access the actual view op base layout and use
detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314
Approved by: https://github.com/eellison
2025-08-13 12:28:29 +00:00
ecde76c764 [Hierarchical Compile] Sort all regions identically (#158814)
Before we would topologically sort each region individually, this works well except if some nodes have no arguments, then their order may change. To rectify this, we sort the first region as the reference region and use that sort order to sort the remaining regions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158814
Approved by: https://github.com/williamwen42
2025-08-13 11:55:23 +00:00
34ec5ed275 [Dynamo][Hierarchical Compile] Allow parameters to be propagated to submodules (#157979)
Fixes issue with HF Gen AI models where we mark a param as static and a get_attr node gets put in the region.

The effect of this is lifting get_attr nodes to be inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157979
Approved by: https://github.com/williamwen42
2025-08-13 09:12:10 +00:00
641ee74781 Revert "Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss (#150282)"
This reverts commit f990490a23815ea6ee27e487c70ba2cf513ba43d.

Reverted https://github.com/pytorch/pytorch/pull/150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150282#issuecomment-3182844949))
2025-08-13 09:01:52 +00:00
6e8865fbc1 port 3 distributed test to Intel GPU and unified some common functions (#158533)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  - requires_nccl_version
  - _dynamo_dist_per_rank_init
  - DynamoDistributedSingleProcTestCase
  - DistTestCases
  - FSDPTestMultiThread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158533
Approved by: https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-08-13 08:13:23 +00:00
9a06e6d031 [claude-code] Add top-level module doc for torch/distributed/tensor/_op_schema.py (#157804)
Not sure how good the description is, seeking insight from maintainers.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157804
Approved by: https://github.com/wanchaol
2025-08-13 07:27:11 +00:00
6ea8376f84 Enable XPU for test_autograd_function.py (#160309)
# Description
Fixes #114850, we will port dynamo tests to Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

# Changes
1. Get device type from get_devtype() method.
2. Replace the requires_cuda_and_triton with requires_gpu.
3. Add HAS_XPU_AND_TRITON into the scope.

# Notify

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160309
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-08-13 06:38:34 +00:00
8eee08d227 Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK (#160411)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160411
Approved by: https://github.com/ezyang
2025-08-13 06:31:10 +00:00
e497620260 Add compile_id: Optional[CompileID] to torch._logging._internal.trace_structured_artifact (#160440)
Context:
When writing a custom `torch.compile` backend, I quite frequently (ab)use `trace_structured_artifact` because I'm too lazy to customize tlparse (ref: 6d8b13c867).

I recently notice some of the artifacts I want to store are generated where CompileID cannot be correlated and `tlparse` html says
> Sometimes, logs are made without a compile id. This makes it difficult to correlate related logs. This stack trie shows all places where log entries occurred without compile context; to fix, look an appropriate place in the stack where compile id should have been specified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160440
Approved by: https://github.com/ezyang
2025-08-13 06:28:23 +00:00
199e9abb6a [fx] fix split_module with symint (#160093)
Fixes https://github.com/pytorch/pytorch/issues/155220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160093
Approved by: https://github.com/ezyang
2025-08-13 05:50:15 +00:00
685f15dbea [vllm hash update] update the pinned vllm hash (#160484)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160484
Approved by: https://github.com/pytorchbot
2025-08-13 04:54:03 +00:00
85db508af5 Wrap class definitions in set_fullgraph(False) in test_int/bool/float/complex (#160276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160276
Approved by: https://github.com/zou3519
ghstack dependencies: #160216, #160217
2025-08-13 04:53:03 +00:00
27156ec804 Wrap class definitions in set_fullgraph(False) in test_operator (#160217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160217
Approved by: https://github.com/zou3519
ghstack dependencies: #160216
2025-08-13 04:53:03 +00:00
6746bc59df Wrap class definitions in set_fullgraph(False) in test_set (#160216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160216
Approved by: https://github.com/zou3519
2025-08-13 04:53:03 +00:00
3008d985a8 [CD] Do not build pytorch with nvshem on ARM (#160465)
As nvshmem binary from 3.3.9 is not compatible with manylinux2_28, and 3.3.20 is not available for download yet
Also, package nvshmem binary into full wheel

Fixes https://github.com/pytorch/pytorch/issues/160425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160465
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-08-13 04:10:43 +00:00
652a6f5954 Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)"
This reverts commit 5a9c4cfce42b9eb87da0de40c5633f083115c307.

Reverted https://github.com/pytorch/pytorch/pull/160403 on behalf of https://github.com/malfet due to It indeed consistently broken inductor, see 118bc97b14/1 ([comment](https://github.com/pytorch/pytorch/pull/160403#issuecomment-3182101130))
2025-08-13 04:05:46 +00:00
118bc97b14 Write full tensors out at once in HF consolidation script (#159394)
Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller.

Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394
Approved by: https://github.com/saumishr
ghstack dependencies: #159392, #159393
2025-08-13 03:51:16 +00:00
305fa22393 [GHF] Remove app { name databaseId} query (#160494)
From `PRCheckSuites` fragment, as it's causes security exception when used with new GITHUB_TOKEN, that will looks as follows
```
RuntimeError: GraphQL query
fragment PRReviews on PullRequestReviewConnection {
  nodes {
    author {
      login
    }
    bodyText
    createdAt
    authorAssociation
    editor {
      login
    }
    databaseId
    url
    state
  }
  pageInfo {
    startCursor
    hasPreviousPage
  }
}

fragment PRCheckSuites on CheckSuiteConnection {
  edges {
    node {
      app {
        name
        databaseId
      }
      workflowRun {
        workflow {
          name
          databaseId
        }
        databaseId
        url
      }
      checkRuns(first: 50) {
        nodes {
          name
          conclusion
          detailsUrl
          databaseId
          title
          summary
        }
        pageInfo {
          endCursor
          hasNextPage
        }
      }
      conclusion
    }
    cursor
  }
  pageInfo {
    hasNextPage
  }
}

fragment CommitAuthors on PullRequestCommitConnection {
  nodes {
    commit {
      authors(first: 2) {
        nodes {
          user {
            login
          }
          email
          name
        }
      }
      oid
    }
  }
  pageInfo {
    endCursor
    hasNextPage
  }
}

query ($owner: String!, $name: String!, $number: Int!) {
  repository(owner: $owner, name: $name) {
    pullRequest(number: $number) {
      closed
      isCrossRepository
      author {
        login
      }
      title
      body
      headRefName
      headRepository {
        nameWithOwner
      }
      baseRefName
      baseRefOid
      baseRepository {
        nameWithOwner
        isPrivate
        defaultBranchRef {
          name
        }
      }
      mergeCommit {
        oid
      }
      commits_with_authors: commits(first: 100) {
        ...CommitAuthors
        totalCount
      }
      commits(last: 1) {
        nodes {
          commit {
            checkSuites(first: 10) {
              ...PRCheckSuites
            }
            status {
              contexts {
                context
                state
                targetUrl
              }
            }
            oid
          }
        }
      }
      changedFiles
      files(first: 100) {
        nodes {
          path
        }
        pageInfo {
          endCursor
          hasNextPage
        }
      }
      reviews(last: 100) {
        ...PRReviews
      }
      comments(last: 5) {
        nodes {
          bodyText
          createdAt
          author {
            login
          }
          authorAssociation
          editor {
            login
          }
          databaseId
          url
        }
        pageInfo {
          startCursor
          hasPreviousPage
        }
      }
      labels(first: 100) {
        edges {
          node {
            name
          }
        }
      }
    }
  }
}
, args {'name': 'pytorch', 'owner': 'pytorch', 'number': 159820} failed: [{'type': 'FORBIDDEN', 'path': ['repository', 'pullRequest', 'commits', 'nodes', 0, 'commit', 'checkSuites', 'edges', 4, 'node', 'app'], 'extensions': {'saml_failure': False}, 'locations': [{'line': 26, 'column': 7}], 'message': 'Resource not accessible by integration'}]
```
But the same query works fine if executed using one's Personal Access Token

Updated mocks file by running
```
sed -i -e s/a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/ gql_mocks.json
sed -i -e s/157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f/28349cb4c891bbf85255fab2c33c770baf77c3e02b29ca9a0e4c6c97bed041db/ gql_mocks.json
sed '/"app": {/,+3d' gql_mocks-orig.json >gql_mocks.json
sed '/"app": null/d' gql_mocks-orig.json >gql_mocks.json
```

Undisable offending jobs

Fixes https://github.com/pytorch/pytorch/issues/159894
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160494
Approved by: https://github.com/huydhn
ghstack dependencies: #160490, #160492
2025-08-13 03:46:39 +00:00
1151b40cbf [BE] Filter unused mocks (#160492)
Somebody checked in twice the number of mocks into the archive

Filter them out by running following script
```python
import json
with open("gql_mocks-orig.json") as f:
    mocks = json.load(f)

keys = list(mocks.keys())
good_shas = {'a32a7ca3a2f6e2c9de07aef821b0111539758b4ac254f8a3432af32314f94876',
             '157add81c519f614388f3a67e287bdf4fbb1791e6d0bffe312e169d02ac2813f',
             '4715ed05b382e572135c049664939f22f9b1249bc0c499ae278d655ad8cb598b',
             'a91ab398f97fb43cbe6e0899980dad8ff7447457ea5a71bbc59f7702a9280eb5',
             'e5130469b5373479776bfbccade8039ce4741b97873bb3bec4e279fed08602be',
             '5dc32efeb8306f03744f6804ef4b500882f2759f7ac17fdc9f123669bfe4805a',
             '0a34acb829d8aca9dd28a8ba388dfa52f6ecdde7e903ace1caabdcfaba87de98',
             '8b50878b010492fe64005cc4b4ed34ac5f6695ce093f06b0d8d5403b7787c2c0',
             '2877b3b1e8630ca4ae797b9d85d5673d25ca8488c01141e11ff55f4a1359fca7'}
for k in keys:
    if any(sha in k for sha in good_shas):
        continue
    del mocks[k]

with open("gql_mocks.json","w") as f:
    json.dump(mocks, f, indent=2)
    f.write("\n")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160492
Approved by: https://github.com/huydhn
ghstack dependencies: #160490
2025-08-13 03:46:39 +00:00
d0f9785af3 [CI] Prevent accidental gql_mocks updates by test_trymerge (#160490)
As they could not longer be fetched from GitHub, see https://github.com/pytorch/pytorch/issues/160489
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160490
Approved by: https://github.com/huydhn
2025-08-13 03:46:32 +00:00
ba47821f52 [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X (#160444)
* thread_work_size of 16 is giving better perf with many workloads for MI300X

cherry-pick of fb81400d34

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160444
Approved by: https://github.com/jeffdaily
2025-08-13 03:41:25 +00:00
2c5e10a5fc Add new function consolidate_safetensors_files_on_every_rank for HF consolidation (#159393)
Currently we are only using rank-0 for HF consolidation. But we should be able to use every rank to consolidate the sharded files, which will speed up the consolidation by Nx (where N is the number of ranks). Adding a new method consolidate_safetensors_files_on_every_rank to do this.

Differential Revision: [D79000720](https://our.internmc.facebook.com/intern/diff/D79000720/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159393
Approved by: https://github.com/saumishr
ghstack dependencies: #159392
2025-08-13 03:31:36 +00:00
355462e127 Add stable Tensor get_device_index, use more stable DeviceIndex (#160143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160143
Approved by: https://github.com/mikaylagawarecki
2025-08-13 03:27:10 +00:00
41673110cd [inductor] Windows inductor use intel-openmp. (#160258)
After some debug work, I found PyTorch torch_cpu.dll is using intel-openmp, but not MSVC openmp.
So, switch Windows inductor to intel-openmp.

It fixed: c8205cb354/test/inductor/test_aot_inductor.py (L2405-L2408)
<img width="896" height="230" alt="image" src="https://github.com/user-attachments/assets/273b00f8-7dc1-43c9-9b7f-752e16355a80" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160258
Approved by: https://github.com/ezyang
2025-08-13 02:36:19 +00:00
6be6d06295 Avoid potential deadlocks in host allocator (#159352)
# Motivation
This PR fixes a potential deadlock in the host allocator.
When calling `event->record(stream)`, the `record_stream` implementation may acquire the Python GIL.
In places such as 842cc77ab9/aten/src/ATen/cuda/CachingHostAllocator.cpp (L145-L151), and 842cc77ab9/aten/src/ATen/xpu/CachingHostAllocator.cpp (L22-L28) `record_stream` is invoked while holding the allocator lock.

To prevent deadlocks, we must ensure the locking order is:
**GIL → Allocator Lock**.
Reversing the order (**Allocator Lock → GIL**) can cause a deadlock.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159352
Approved by: https://github.com/cyyever, https://github.com/ezyang
2025-08-13 02:30:17 +00:00
f15ada5c6f Enable output padding when only outermost dim is dynamic (#159404)
Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79146886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159404
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-08-13 01:28:22 +00:00
69a0a9aa7f [Inductor][Triton] Pass GPUTarget param to updated make_ir function (#160422)
Summary: A recent Triton commit changed `ASTSource.make_ir` to a 5-arg signature that includes a `GPUTarget`. We need to pass in this new argument.

Test Plan:
`buck2 test 'fbcode//mode/opt' -m ovr_config//triton:trunk  fbcode//caffe2/test/inductor:test_inductor_cuda -- triton_kernel`

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D80069909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160422
Approved by: https://github.com/davidberard98, https://github.com/mlazos
2025-08-13 01:27:57 +00:00
32099961d5 [EZ] Delete CircleCI case (#160479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160479
Approved by: https://github.com/izaitsevfb
ghstack dependencies: #160477
2025-08-13 01:19:09 +00:00
8d1cf52922 [EZ][BE] Remove unused conda-env-macOS-ARM64 (#160477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160477
Approved by: https://github.com/atalman
2025-08-12 23:41:25 +00:00
b1f43548ca [c10d] Error out the case when registering symmetric memory without eager init (#160145)
Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160145
Approved by: https://github.com/kwen2501
2025-08-12 23:25:04 +00:00
0d71ca2c46 [EZ] Replace pytorch-labs with meta-pytorch (#160459)
This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'

## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files

## Files Modified
This PR updates files that contained the target text.

Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
2025-08-12 22:44:25 +00:00
5737372862 [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882)
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

This should help increase available runners even with same number of CI nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 22:42:40 +00:00
2e4e5ab4be [MPS] Add mps keys to indices and values ops (#160223)
enable indices and values on sparse mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223
Approved by: https://github.com/malfet
2025-08-12 22:08:44 +00:00
16d15445f8 Fullgraph graph capture with dynamo. (#159749)
Summary:
Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0

We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API.

This PR adds a prototype for the API described in the doc.

Test Plan:
test_misc -- -k test_aot_capture

Rollback Plan:

Differential Revision: D79534608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159749
Approved by: https://github.com/tugsbayasgalan
2025-08-12 22:06:18 +00:00
101276f81b [BE] Save attributes for CppCompileError for pickleing (#160294)
Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/)

Context:
When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by
```
Traceback (most recent call last):
  File "/torch/_inductor/autotune_process.py", line 175, in get
    result = TuningProcess.recv(self.read_pipe)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/autotune_process.py", line 99, in recv
    return pickle.load(read_pipe)
           ^^^^^^^^^^^^^^^^^^^^^^
TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output'
```
which is unexpected. After asking claude, it seems

> Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly.
>
> The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments.
>
> The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly.
>
> The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix:

Adding these seem to help.

fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160294
Approved by: https://github.com/masnesral
2025-08-12 22:03:36 +00:00
cbffde7745 Factor out the strings to templates for better editor integration (#160357)
# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-12 21:59:54 +00:00
78a2fe1d42 [TorchScript] thread-safe ErrorReport::CallStack (#160386)
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160386
Approved by: https://github.com/eellison
2025-08-12 21:59:04 +00:00
f8f0414a59 fix cpp builder to avoid missing-source compile error (#160354)
Summary:
the condition
```
if config.is_fbcode() and (not self._aot_mode or self._use_relative_path):
    sources = [os.path.basename(i) for i in sources]
```
unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths.

Fixes internal issue:
```

FAILED (errors=1)

CppCompileError: C++ compile error

Command:
/mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so

Output:
clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp'
clang-17: error: no input files
```

Reviewed By: clee2000

Differential Revision: D80025417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160354
Approved by: https://github.com/benjaminglass1, https://github.com/clee2000
2025-08-12 21:36:22 +00:00
4d419a7461 Add pad and narrow to torch/csrc/stable/ops.h (#159328)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159328
Approved by: https://github.com/janeyx99
ghstack dependencies: #159507
2025-08-12 21:29:49 +00:00
655137b678 Update torch::stable::Tensor() default constructor (#159507)
Allows things like

```cpp
Tensor cu_seqlens_q;
if (...) {
   cu_seqlens_q = ...
}
...
```

Also adds `torch::stable::Tensor.defined()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159507
Approved by: https://github.com/janeyx99
2025-08-12 21:29:49 +00:00
f27232a213 [ROCm] Limit number of values per thread for reductions on three dimensions (#159652)
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652
Approved by: https://github.com/jeffdaily
2025-08-12 21:15:56 +00:00
c24ca7f4bf [FSDP][Collectives] skipping allgather when world size is 1 (#160135)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160135
Approved by: https://github.com/weifengpy
2025-08-12 21:13:29 +00:00
b4596895b9 [DTensor] Registers sharding rule for rms_norm (#159692)
Reduces collective calls in the forward pass from 2 to 1

In #158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After #159324 this should get properly dispatched hence I am adding it now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159692
Approved by: https://github.com/tianyu-l
2025-08-12 21:05:24 +00:00
5a9c4cfce4 [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)
Fixes #160243, Fixes #160244, Fixes #160245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403
Approved by: https://github.com/janeyx99
2025-08-12 21:02:44 +00:00
a354fa91e2 added class or module info for functions blocked by weight-only load (#159935)
Fixes #152985
In #152985, users are confused why weights-only load failed even though functions were registered in safe_globals.
Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration.
This fix is to make that point more clear.

Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out.
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error:

Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'>

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

To execute this test, run the following from the base repo dir:
    python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159935
Approved by: https://github.com/mikaylagawarecki
2025-08-12 20:52:25 +00:00
f95b58c284 Remove usage of fsspec in HF consolidation script (#159392)
Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only

Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392
Approved by: https://github.com/sibuachu
2025-08-12 20:41:06 +00:00
8e6a313858 Add ownership token when needed on GradientEdge (#160098)
We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare.
Happy to be talked into doing it though if someone really wants to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160098
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-08-12 20:14:18 +00:00
7e91394955 Support NUMA Binding for Callable Entrypoints (#160163)
# Context
This is an extension of #149334.

# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.

Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.

Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran

```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```

and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings

I also ran similar with `str` entrypoints as before just to be sure it's still working.

NOTE: [--run-path triggers the code to be run inside a `Callable`.](017259f9c6/torch/distributed/run.py (L870))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163
Approved by: https://github.com/d4l3k
2025-08-12 20:08:49 +00:00
89654db1ab [inductor] fix triton bucketize mask propagation (#159961)
See 6b414f56a4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961
Approved by: https://github.com/eellison
2025-08-12 19:59:32 +00:00
2d0cdee394 move thread-local capture mode guard to include work.isStarted (#160398)
Per title, should fix capture errors that happen because nccl watchdog races with capture start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398
Approved by: https://github.com/aorenste
2025-08-12 19:25:04 +00:00
eqy
9903ca4f70 [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-08-12 18:07:41 +00:00
f341077ce4 Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281)"
This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b.

Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](a7abf57aab).  Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))
2025-08-12 17:57:57 +00:00
3cec82a7e9 Ensure outer aliasing on DTensor matches inner aliasing (#158954)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954
Approved by: https://github.com/albanD, https://github.com/wconstab
2025-08-12 17:47:48 +00:00
ee9f8ba11d [ROCm] Use opportunistic fastatomics based on hueristics (#159430)
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-08-12 17:13:54 +00:00
1f4057c11a [inductor] remove no_x_dim (#159810)
no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional.

no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue.

However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell.

H100 inference benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

H100 training benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

Overall, the benchmarks show minimal change in performance.

Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-08-12 17:10:31 +00:00
94b91a8763 [redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352)
Summary:
Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.
ref: D79456310 (got reverted because of linter)

Testing:
Refer differential Revision: D79917440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352
Approved by: https://github.com/masnesral
2025-08-12 16:49:08 +00:00
a7abf57aab [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-08-12 16:42:55 +00:00
f7b2f3314c Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000)"
This reverts commit d0e2240f680ea2a553f7ee8188f52482e130bfd0.

Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))
2025-08-12 16:33:02 +00:00
9d37c960a4 [ROCm][CI] use new benchmark image for dynamo (#160421)
Follow-up to #160047 that separated the rocm image into default CI and benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 16:07:19 +00:00
b219ca2a00 Revert "Update triton xpu commit to support python 3.14 (#160183)"
This reverts commit 7fbc22855c17741ae016992803b2e147a13aa22d.

Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](7fbc22855c).  Maybe because the docker build changed?  Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))
2025-08-12 15:29:19 +00:00
b7db86600a Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416)
Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416
Approved by: https://github.com/malfet
2025-08-12 15:15:12 +00:00
9708fcf92d Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120)
This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user *only* changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet.

This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time.

This should handle 99% of cases, though. I added an expectedFailure test to show the limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120
Approved by: https://github.com/zou3519
2025-08-12 14:11:06 +00:00
a288b15ea9 [CI] Reduce XPU Windows build time (#159763)
Reduce the time cost from 2.5 hours to about 1.5 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:04:29 +00:00
7fbc22855c Update triton xpu commit to support python 3.14 (#160183)
Follow PR #159725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:02:36 +00:00
f33ce40bc0 [bucketing] Bucket only adjacent collectives to prevent reordering (#159983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983
Approved by: https://github.com/wconstab, https://github.com/eellison
2025-08-12 11:57:00 +00:00
4d5b3f2d5a [dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796
Approved by: https://github.com/jansel
2025-08-12 09:49:11 +00:00
f990490a23 Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss (#150282)
Fixes #91545

## Changes

- Add `label_smoothing` param and docs
- Add test case for `label_smoothing`
- Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss`

##  Test Result

```bash
pytest -s test/test_nn.py -k test_bce
```

![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05)

![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964)

![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282
Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki
2025-08-12 09:37:03 +00:00
b9003ed3d8 Dynamo Deep Dive Documentation Fix (#158860)
changed SourceBuilder to VariableBuilder

Fixes #158447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860
Approved by: https://github.com/mlazos
2025-08-12 08:53:33 +00:00
fea7e9dd37 extract shape in _view_has_unbacked_input (#160255)
Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..]

Test Plan:
existing tests.

Rollback Plan:

Differential Revision: D79951119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255
Approved by: https://github.com/bobrenjc93
2025-08-12 08:38:19 +00:00
9a0f7a3bb0 [retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348)
refer: https://github.com/pytorch/pytorch/pull/159655

Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed.
Updated test_dynamo_timed + re-ran locally to test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348
Approved by: https://github.com/masnesral
2025-08-12 06:24:54 +00:00
01bcf9a40d Bump transformers pin (#159291)
Trying to update hf pin.

Benchmarking run to figure out issues

<img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" />

Retrying - https://github.com/pytorch/pytorch/pull/156118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-12 05:14:17 +00:00
8d3d1c8443 [dynamo] fixes to propagate tag safeness (#159807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807
Approved by: https://github.com/jansel
2025-08-12 04:50:13 +00:00
0f3b10b8ee [audio hash update] update the pinned audio hash (#160384)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384
Approved by: https://github.com/pytorchbot
2025-08-12 04:38:04 +00:00
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
edaa151d0d [CI] Move CUDA tests to trunk workflow (#160379)
Which is getting run before PR is merged anyway, but according to 3X
less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b)
<img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" />

I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379
Approved by: https://github.com/izaitsevfb
2025-08-12 04:23:50 +00:00
10bc36fe84 Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341)
Short-term fix for https://github.com/pytorch/pytorch/issues/160333

The problem is:
1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation
2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented.
3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition.

The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented
return logic into the decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341
Approved by: https://github.com/drisspg
2025-08-12 04:09:37 +00:00
32e5e2f596 [vllm hash update] update the pinned vllm hash (#160259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259
Approved by: https://github.com/pytorchbot
2025-08-12 04:04:53 +00:00
bfc873d02e [ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083)
This reverts the changes from b367e5f6a6. This will also close https://github.com/pytorch/pytorch/pull/158922.

Since 30387ab2e4, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib.

There was also a bug in here - if `ROCM_DIR` is unset, the code crashes:
```
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command
    cmd_obj.run()
  File "D:\b\pytorch_main\setup.py", line 853, in run
    rocm_dir_path = Path(os.environ["ROCM_DIR"])
                         ~~~~~~~~~~^^^^^^^^^^^^
  File "<frozen os>", line 714, in __getitem__
KeyError: 'ROCM_DIR'
```
The code could have checked for `ROCM_PATH` too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083
Approved by: https://github.com/jeffdaily
2025-08-12 02:45:49 +00:00
eed9dbf70f [ROCm] Add torch/_rocm_init.py to .gitignore. (#159806)
Follow-up to https://github.com/pytorch/pytorch/pull/155285.

Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like:

```python
def initialize():
    import rocm_sdk
    rocm_sdk.initialize_process(
        preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'],
        check_version='7.0.0rc20250804')
```

We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520).

This file is then loaded if present here: a7f3bdf550/torch/__init__.py (L145-L157)

Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806
Approved by: https://github.com/jeffdaily
2025-08-12 02:24:21 +00:00
be53f609aa fix retaining multimem in symmetric memory (#160343)
fixes OOM in #160289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343
Approved by: https://github.com/eqy
2025-08-12 02:03:20 +00:00
95210cc409 [BE] Isolate pre-push hook dependencies in dedicated virtual environment (#160048)
This adds two changes:
- Isolates pre-push hook dependencies into an isolated venv, no longer affect your system environment
- Lets you manually run the pre-push lintrunner (including with lintrunner -a) by invoking `python scripts/lintrunner.py [-a]` (it's ugly, but better than nothing...for now)

This is a follow up to:
- https://github.com/pytorch/pytorch/pull/158389

## Problem
The current pre-push hook setup installs lintrunner and related dependencies globally, which makes developers nervous about system pollution and can cause version conflicts with existing installations.

Also, if the pre-push lintrunner found errors, you had to hope your normal lintrunner could fix them (which wasn't always the case, e.g. if those errors only manifested in certain python versions)

##  Key Changes:
  - Isolated Environment: Creates .git/hooks/linter/.venv/ with Python 3.9 (the python used in CI) and an isolated lintrunner installation
  - User-Friendly CLI: New python scripts/lintrunner.py wrapper allows developers to run lintrunner (including -a auto-fix) from any environment
  - Simplified Architecture: Eliminates pre-commit dependency entirely - uses direct git hooks

  File Changes:
  - scripts/setup_hooks.py: Rewritten to create isolated uv-managed virtual environment
  - scripts/lintrunner.py: New wrapper script with shared hash management logic
  - scripts/run_lintrunner.py: Removed (functionality merged into lintrunner.py)
  - .pre-commit-config.yaml: Removed (no longer needed)

##  Usage:
```
  # Setup (run once)
  python scripts/setup_hooks.py

  # Manual linting (works from any environment)
  python scripts/lintrunner.py        # Check mode
  python scripts/lintrunner.py -a     # Auto-fix mode

  # Git hooks work automatically
  git push  # Runs lintrunner in isolated environment

  # Need to skip the pre-push hook?
  git push --no-verify
```

##  Benefits:
  -  Zero global dependency installation
  -  Per-repository isolation prevents version conflicts
  -  Full lintrunner functionality is now accessible

##  Implementation Notes:
  - Virtual env is kept in a dedicated dir in .git, to keep per-repo mechanics
  - lintrunner.py does not need to be invoked from a specific venv.  It'll invoke the right venv itself.

A minor bug: It tends to garble the lintrunner output a bit, like the screenshot below shows, but I haven't found a workaround so far and it remains understandable to users:
<img width="241" height="154" alt="image" src="https://github.com/user-attachments/assets/9496f925-8524-4434-8486-dc579442d688" />

## What's next?
Features that could be added:
- Check for lintrunner updates, auto-update if needed
- Depending on dev response, this could be enabled by default for all pytorch/pytorch environments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160048
Approved by: https://github.com/seemethere
2025-08-12 01:58:46 +00:00
7a974a88f2 [ROCm] Fix resource_strings.h (#159996)
This PR fixes the errors like below:

```
[rank7]: RuntimeError: /tmp/comgr-c3c81b/input/CompileSourceejOPx6:34:8: error: unknown type name 'uint64_t'; did you mean
'__hip_internal::uint64_t'? [rank7]: 34 | if(((uint64_t) t0.data) % (4 * sizeof(half)) != 0) flag_vec4 = false;
```

The following datatypes needs to be defined in `torch/csrc/jit/codegen/fuser/cuda/resource_strings.h` for ROCm versions >= 7.0.

```
typedef unsigned char uint8_t;
typedef signed char int8_t;
typedef short int  int16_t;
typedef long long int int64_t;
typedef unsigned long long int uint64_t;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159996
Approved by: https://github.com/pruthvistony, https://github.com/Skylion007, https://github.com/jeffdaily
2025-08-12 01:58:02 +00:00
f3f159ff8c [BE][cutlass backend] Reduce severity of log message for no cutlass config found (#160148)
This is not really a problem. Sometimes we cannot find a cutlass config due to shape, e.g. when k is odd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160148
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-08-12 01:41:58 +00:00
b90feeac86 [BE][cutlass backend] Fix subproc addmm tests (#160295)
Differential Revision: [D79977421](https://our.internmc.facebook.com/intern/diff/D79977421/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160295
Approved by: https://github.com/jingsh
2025-08-12 01:41:06 +00:00
0d40ff3b49 [inductor] fix test_different_file_paths_local_pgo on Windows. (#160382)
fix test_different_file_paths_local_pgo on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160382
Approved by: https://github.com/angelayi
2025-08-12 01:35:39 +00:00
cae2b5e3d2 [ROCm][Windows] Enable USE_ROCM, disable USE_RCCL on Windows. (#159079)
This allows setting `USE_ROCM` on Windows. A few other patches are still required to build (see https://github.com/ROCm/TheRock/issues/589), but we have instructions using open source code and rocm python packages available at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch#build-pytorch-with-rocm-support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159079
Approved by: https://github.com/jeffdaily
2025-08-12 01:28:20 +00:00
ee89cc7a0a [ROCm][Windows] Fix LoadHIP handling of environment variable paths on Windows. (#159080)
See https://cmake.org/cmake/help/latest/command/file.html#path-conversion. Paths stored in environment variables may use `/` or `\` (e.g. on Windows), while cmake-style paths always use `/`.

This fixes configure errors like:
```
CMake Error at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 (set):
  Syntax error in cmake code at

    D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2

  when parsing string

    D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel/cmake/;D:/b/pytorch_main/cmake/Modules

  Invalid character escape '\p'.

CMake Error at D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/cmake/data/share/cmake-3.31/Modules/Internal/CheckSourceCompiles.cmake:108 (try_compile):
  Failed to configure test project build system.
```

(note the mixed usage of `\` and `/` in that string)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159080
Approved by: https://github.com/jeffdaily
2025-08-12 00:18:19 +00:00
e63c2b21c1 [PP] Initialize P2P communicators on first step (#160210)
Was hitting hangs in multi-node settings and initializing the NCCL communicators needed for batch p2p ops ahead of time fixes this.

This change adds extra communication since it communicates a dummy tensor to next and previous stage ranks. However, this is only paid on the first step so it is negligible.

Debug history: https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160210
Approved by: https://github.com/wconstab
2025-08-11 23:46:58 +00:00
3626ba711b [FlexAttention] Swap from and to & for new triton (#160227)
Fixes #158463

On B200 I am getting a bunch of error spew:
```Shell
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
Triton compilation failed: triton_tem_fused_zeros_1
def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0):
    PRESCALE_QK : tl.constexpr = False
```
```Shell
74 = arith.subi %170, %166 : i32
          %175 = arith.muli %174, %c128_i32 : i32
          %176 = arith.subi %175, %c64_i32 : i32
          %177 = arith.extui %173 : i1 to i32
          %178 = arith.muli %176, %177 : i32
          %179 = arith.subi %c1_i32, %177 : i32
          %180 = arith.muli %179, %c64_i32 : i32
          %181 = arith.addi %178, %180 : i32
          %182 = arith.muli %181, %c64_i32 : i32
          %183 = tt.splat %182 : i32 -> tensor<64x64xi32>
          %184 = tt.addptr %arg19, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %185 = tt.addptr %arg20, %183 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %186 = tt.splat %181 : i32 -> tensor<64xi32>
          %187 = arith.addi %arg21, %186 : tensor<64xi32>
          scf.yield %163, %184, %185, %187 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
        }
        %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32>
        %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1>
        %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32>
        %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32>
        %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
        %122 = arith.select %115, %cst_4, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1>
        %123 = tt.broadcast %122 : tensor<1x64xi1> -> tensor<64x64xi1>
        %124 = arith.select %123, %121, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
        %125 = arith.mulf %124, %cst_2 : tensor<64x64xf32>
        %126 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32>
        %127 = arith.subf %125, %126 : tensor<64x64xf32>
        %128 = math.exp2 %127 : tensor<64x64xf32>
        %129 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %130 = tt.dot %51, %129, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %131 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
        %132 = tt.broadcast %131 : tensor<64x1xf32> -> tensor<64x64xf32>
        %133 = arith.subf %130, %132 : tensor<64x64xf32>
        %134 = arith.mulf %128, %133 : tensor<64x64xf32>
        %135 = arith.mulf %134, %cst_3 : tensor<64x64xf32>
        %136 = arith.select %116, %135, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
        %137 = arith.select %115, %122, %cst_5 : tensor<1x64xi1>, tensor<1x64xi1>
        %138 = tt.broadcast %137 : tensor<1x64xi1> -> tensor<64x64xi1>
        %139 = arith.select %138, %136, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
        %140 = arith.truncf %139 : tensor<64x64xf32> to tensor<64x64xf16>
        %141 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
        %142 = tt.dot %140, %141, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        scf.yield %142 : tensor<64x64xf32>
      } else {
        scf.yield %cst_9 : tensor<64x64xf32>
      }
      %84 = tt.addptr %arg13, %22 : !tt.ptr<i32>, i32
      %85 = tt.load %84 : !tt.ptr<i32>
      %86 = arith.muli %85, %c128_i32 : i32
      %87 = tt.addptr %arg12, %21 : !tt.ptr<i32>, i32
      %88 = tt.load %87 : !tt.ptr<i32>
      %89 = tt.splat %86 : i32 -> tensor<64xi32>
      %90 = arith.addi %89, %14 : tensor<64xi32>
      %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
      %92 = arith.muli %91, %cst_11 : tensor<1x64xi32>
      %93 = tt.addptr %71, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
      %94 = tt.broadcast %93 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %95 = tt.addptr %94, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %96 = tt.addptr %76, %92 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
      %97 = tt.broadcast %96 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %98 = tt.addptr %97, %74 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %99 = arith.muli %88, %c2_i32 : i32
      %100 = arith.minsi %99, %c4_i32 : i32
      %101 = arith.cmpi sge, %100, %c1_i32 : i32
      %102 = scf.if %101 -> (tensor<64x64xf32>) {
        %112 = arith.subi %100, %c1_i32 : i32
        %113:4 = scf.for %arg17 = %c0_i32 to %112 step %c1_i32 iter_args(%arg18 = %83, %arg19 = %95, %arg20 = %98, %arg21 = %90) -> (tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>)  : i32 {
          %137 = tt.expand_dims %arg21 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
          %138 = arith.cmpi slt, %137, %cst_7 : tensor<1x64xi32>
          %139 = tt.broadcast %138 : tensor<1x64xi1> -> tensor<64x64xi1>
          %140 = tt.load %arg19, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %141 = tt.dot %46, %140, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %142 = arith.mulf %141, %cst_13 : tensor<64x64xf32>
          %143 = arith.mulf %142, %cst_3 : tensor<64x64xf32>
          %144 = arith.mulf %143, %cst_2 : tensor<64x64xf32>
          %145 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32>
          %146 = arith.subf %144, %145 : tensor<64x64xf32>
          %147 = math.exp2 %146 : tensor<64x64xf32>
          %148 = tt.load %arg20, %139, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %149 = tt.dot %51, %148, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %150 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
          %151 = tt.broadcast %150 : tensor<64x1xf32> -> tensor<64x64xf32>
          %152 = arith.subf %149, %151 : tensor<64x64xf32>
          %153 = arith.mulf %147, %152 : tensor<64x64xf32>
          %154 = arith.mulf %153, %cst_3 : tensor<64x64xf32>
          %155 = arith.truncf %154 : tensor<64x64xf32> to tensor<64x64xf16>
          %156 = tt.trans %140 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %157 = tt.dot %155, %156, %arg18, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %158 = arith.divsi %arg17, %c2_i32 : i32
          %159 = tt.addptr %84, %158 : !tt.ptr<i32>, i32
          %160 = tt.load %159 evictionPolicy = evict_last : !tt.ptr<i32>
          %161 = arith.addi %158, %c1_i32 : i32
          %162 = arith.cmpi slt, %161, %88 : i32
          %163 = tt.addptr %159, %c1_i32 : !tt.ptr<i32>, i32
          %164 = tt.load %163, %162 evictionPolicy = evict_last : !tt.ptr<i32>
          %165 = arith.addi %arg17, %c1_i32 : i32
          %166 = arith.remsi %165, %c2_i32 : i32
          %167 = arith.cmpi eq, %166, %c0_i32 : i32
          %168 = arith.subi %164, %160 : i32
          %169 = arith.muli %168, %c128_i32 : i32
          %170 = arith.subi %169, %c64_i32 : i32
          %171 = arith.extui %167 : i1 to i32
          %172 = arith.muli %170, %171 : i32
          %173 = arith.subi %c1_i32, %171 : i32
          %174 = arith.muli %173, %c64_i32 : i32
          %175 = arith.addi %172, %174 : i32
          %176 = arith.muli %175, %c64_i32 : i32
          %177 = tt.splat %176 : i32 -> tensor<64x64xi32>
          %178 = tt.addptr %arg19, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %179 = tt.addptr %arg20, %177 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
          %180 = tt.splat %175 : i32 -> tensor<64xi32>
          %181 = arith.addi %arg21, %180 : tensor<64xi32>
          scf.yield %157, %178, %179, %181 : tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
        }
        %114 = tt.expand_dims %113#3 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %115 = arith.cmpi slt, %114, %cst_7 : tensor<1x64xi32>
        %116 = tt.broadcast %115 : tensor<1x64xi1> -> tensor<64x64xi1>
        %117 = tt.load %113#1, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %118 = tt.dot %46, %117, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %119 = arith.mulf %118, %cst_13 : tensor<64x64xf32>
        %120 = arith.mulf %119, %cst_3 : tensor<64x64xf32>
        %121 = arith.select %116, %120, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
        %122 = arith.mulf %121, %cst_2 : tensor<64x64xf32>
        %123 = tt.broadcast %61 : tensor<64x1xf32> -> tensor<64x64xf32>
        %124 = arith.subf %122, %123 : tensor<64x64xf32>
        %125 = math.exp2 %124 : tensor<64x64xf32>
        %126 = tt.load %113#2, %116, %cst_8 : tensor<64x64x!tt.ptr<f16>>
        %127 = tt.dot %51, %126, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        %128 = tt.expand_dims %55 {axis = 1 : i32} : tensor<64xf32> -> tensor<64x1xf32>
        %129 = tt.broadcast %128 : tensor<64x1xf32> -> tensor<64x64xf32>
        %130 = arith.subf %127, %129 : tensor<64x64xf32>
        %131 = arith.mulf %125, %130 : tensor<64x64xf32>
        %132 = arith.mulf %131, %cst_3 : tensor<64x64xf32>
        %133 = arith.select %116, %132, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
        %134 = arith.truncf %133 : tensor<64x64xf32> to tensor<64x64xf16>
        %135 = tt.trans %117 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
        %136 = tt.dot %134, %135, %113#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
        scf.yield %136 : tensor<64x64xf32>
      } else {
        scf.yield %83 : tensor<64x64xf32>
      }
      %103 = tt.splat %33 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %104 = tt.addptr %103, %37 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %105 = tt.broadcast %104 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %106 = tt.addptr %105, %42 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %107 = arith.mulf %102, %cst_13 : tensor<64x64xf32>
      %108 = arith.cmpi slt, %40, %cst_11 : tensor<1x64xi32>
      %109 = tt.broadcast %108 : tensor<1x64xi1> -> tensor<64x64xi1>
      %110 = arith.andi %45, %109 : tensor<64x64xi1>
      %111 = arith.truncf %107 : tensor<64x64xf32> to tensor<64x64xf16>
      tt.store %106, %111, %110 : tensor<64x64x!tt.ptr<f16>>
    } else {
      %16 = arith.divsi %0, %c2_i32 : i32
      %17 = arith.muli %0, %c64_i32 : i32
      %18 = tt.splat %17 : i32 -> tensor<64xi32>
      %19 = arith.addi %18, %14 : tensor<64xi32>
      %20 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
      %21 = arith.muli %20, %cst_14 : tensor<64x1xi32>
      %22 = tt.splat %11 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %23 = tt.addptr %22, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %24 = tt.expand_dims %14 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
      %25 = tt.broadcast %23 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %26 = tt.broadcast %24 : tensor<1x64xi32> -> tensor<64x64xi32>
      %27 = tt.addptr %25, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %28 = arith.cmpi slt, %20, %cst_10 : tensor<64x1xi32>
      %29 = tt.broadcast %28 : tensor<64x1xi1> -> tensor<64x64xi1>
      %30 = tt.load %27, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>>
      %31 = tt.splat %12 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %32 = tt.addptr %31, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %33 = tt.broadcast %32 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %34 = tt.addptr %33, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %35 = tt.load %34, %29, %cst_8 : tensor<64x64x!tt.ptr<f16>>
      %36:2 = scf.for %arg17 = %c0_i32 to %c4_i32 step %c1_i32 iter_args(%arg18 = %cst_9, %arg19 = %cst_9) -> (tensor<64x64xf32>, tensor<64x64xf32>)  : i32 {
        %55 = arith.muli %2, %c4_i32 : i32
        %56 = arith.addi %55, %arg17 : i32
        %57 = arith.muli %56, %c2048_i32 : i32
        %58 = arith.muli %1, %c32768_i32 : i32
        %59 = arith.addi %57, %58 : i32
        %60 = arith.extsi %59 : i32 to i64
        %61 = arith.muli %1, %c16_i32 : i32
        %62 = arith.addi %61, %56 : i32
        %63 = arith.muli %62, %c32_i32 : i32
        %64 = arith.extsi %63 : i32 to i64
        %65 = tt.addptr %arg0, %60 : !tt.ptr<f16>, i64
        %66 = tt.addptr %arg5, %60 : !tt.ptr<f16>, i64
        %67 = tt.addptr %arg3, %64 : !tt.ptr<f32>, i64
        %68 = tt.addptr %arg4, %64 : !tt.ptr<f32>, i64
        %69 = arith.remsi %56, %c16_i32 : i32
        %70 = arith.muli %3, %c16_i32 : i32
        %71 = arith.addi %70, %69 : i32
        %72 = arith.muli %71, %c2_i32 : i32
        %73 = arith.addi %72, %16 : i32
        %74 = tt.addptr %arg11, %73 : !tt.ptr<i32>, i32
        %75 = tt.load %74 : !tt.ptr<i32>
        %76 = arith.muli %75, %c128_i32 : i32
        %77 = tt.addptr %arg10, %73 : !tt.ptr<i32>, i32
        %78 = tt.load %77 : !tt.ptr<i32>
        %79 = tt.splat %76 : i32 -> tensor<64xi32>
        %80 = arith.addi %79, %14 : tensor<64xi32>
        %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %82 = arith.muli %81, %cst_11 : tensor<1x64xi32>
        %83 = tt.splat %65 : !tt.ptr<f16> -> tensor<1x64x!tt.ptr<f16>>
        %84 = tt.addptr %83, %82 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
        %85 = tt.expand_dims %14 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %86 = tt.broadcast %84 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %87 = tt.broadcast %85 : tensor<64x1xi32> -> tensor<64x64xi32>
        %88 = tt.addptr %86, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %89 = tt.expand_dims %80 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %90 = arith.muli %89, %cst_14 : tensor<64x1xi32>
        %91 = tt.splat %66 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
        %92 = tt.addptr %91, %90 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
        %93 = tt.broadcast %92 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %94 = tt.addptr %93, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %95 = arith.muli %78, %c2_i32 : i32
        %96 = arith.minsi %95, %c1_i32 : i32
        %97 = arith.cmpi sge, %96, %c1_i32 : i32
        %98:2 = scf.if %97 -> (tensor<64x64xf32>, tensor<64x64xf32>) {
          %120 = arith.subi %96, %c1_i32 : i32
          %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %arg18, %arg22 = %arg19, %arg23 = %88, %arg24 = %94, %arg25 = %80) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>)  : i32 {
            %167 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
            %168 = arith.cmpi slt, %167, %cst_1 : tensor<1x64xi32>
            %169 = tt.broadcast %168 : tensor<1x64xi1> -> tensor<64x64xi1>
            %170 = tt.load %arg23, %169, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %171 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32>
            %172 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %173 = tt.addptr %172, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %174 = tt.load %173, %171 : tensor<64x!tt.ptr<f32>>
            %175 = arith.cmpf oeq, %174, %cst_16 : tensor<64xf32>
            %176 = arith.select %175, %cst_15, %174 : tensor<64xi1>, tensor<64xf32>
            %177 = tt.dot %30, %170, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %178 = arith.mulf %177, %cst_13 : tensor<64x64xf32>
            %179 = arith.mulf %178, %cst_3 : tensor<64x64xf32>
            %180 = arith.mulf %179, %cst_2 : tensor<64x64xf32>
            %181 = tt.expand_dims %176 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %182 = tt.broadcast %181 : tensor<1x64xf32> -> tensor<64x64xf32>
            %183 = arith.subf %180, %182 : tensor<64x64xf32>
            %184 = math.exp2 %183 : tensor<64x64xf32>
            %185 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
            %186 = arith.cmpi slt, %185, %cst_12 : tensor<64x1xi32>
            %187 = tt.broadcast %186 : tensor<64x1xi1> -> tensor<64x64xi1>
            %188 = tt.load %arg24, %187, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %189 = arith.truncf %184 : tensor<64x64xf32> to tensor<64x64xf16>
            %190 = tt.dot %189, %188, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %191 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %192 = tt.addptr %191, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %193 = tt.load %192, %171 : tensor<64x!tt.ptr<f32>>
            %194 = tt.trans %188 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %195 = tt.dot %35, %194, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %196 = tt.expand_dims %193 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %197 = tt.broadcast %196 : tensor<1x64xf32> -> tensor<64x64xf32>
            %198 = arith.subf %195, %197 : tensor<64x64xf32>
            %199 = arith.mulf %184, %198 : tensor<64x64xf32>
            %200 = arith.mulf %199, %cst_3 : tensor<64x64xf32>
            %201 = arith.truncf %200 : tensor<64x64xf32> to tensor<64x64xf16>
            %202 = tt.trans %170 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %203 = tt.dot %201, %202, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %204 = arith.divsi %arg20, %c2_i32 : i32
            %205 = tt.addptr %74, %204 : !tt.ptr<i32>, i32
            %206 = tt.load %205 evictionPolicy = evict_last : !tt.ptr<i32>
            %207 = arith.addi %204, %c1_i32 : i32
            %208 = arith.cmpi slt, %207, %78 : i32
            %209 = tt.addptr %205, %c1_i32 : !tt.ptr<i32>, i32
            %210 = tt.load %209, %208 evictionPolicy = evict_last : !tt.ptr<i32>
            %211 = arith.addi %arg20, %c1_i32 : i32
            %212 = arith.remsi %211, %c2_i32 : i32
            %213 = arith.cmpi eq, %212, %c0_i32 : i32
            %214 = arith.subi %210, %206 : i32
            %215 = arith.muli %214, %c128_i32 : i32
            %216 = arith.subi %215, %c64_i32 : i32
            %217 = arith.extui %213 : i1 to i32
            %218 = arith.muli %216, %217 : i32
            %219 = arith.subi %c1_i32, %217 : i32
            %220 = arith.muli %219, %c64_i32 : i32
            %221 = arith.addi %218, %220 : i32
            %222 = arith.muli %221, %c64_i32 : i32
            %223 = tt.splat %222 : i32 -> tensor<64x64xi32>
            %224 = tt.addptr %arg23, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %225 = tt.addptr %arg24, %223 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %226 = tt.splat %221 : i32 -> tensor<64xi32>
            %227 = arith.addi %arg25, %226 : tensor<64xi32>
            scf.yield %203, %190, %224, %225, %227 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
          }
          %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
          %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32>
          %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1>
          %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32>
          %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>>
          %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32>
          %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32>
          %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32>
          %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32>
          %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
          %136 = arith.select %28, %cst, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1>
          %137 = tt.broadcast %136 : tensor<64x1xi1> -> tensor<64x64xi1>
          %138 = arith.select %137, %135, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
          %139 = arith.mulf %138, %cst_2 : tensor<64x64xf32>
          %140 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %141 = tt.broadcast %140 : tensor<1x64xf32> -> tensor<64x64xf32>
          %142 = arith.subf %139, %141 : tensor<64x64xf32>
          %143 = math.exp2 %142 : tensor<64x64xf32>
          %144 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
          %145 = arith.cmpi slt, %144, %cst_12 : tensor<64x1xi32>
          %146 = tt.broadcast %145 : tensor<64x1xi1> -> tensor<64x64xi1>
          %147 = tt.load %121#3, %146, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %148 = arith.truncf %143 : tensor<64x64xf32> to tensor<64x64xf16>
          %149 = tt.dot %148, %147, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %150 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %151 = tt.addptr %150, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %152 = tt.load %151, %126 : tensor<64x!tt.ptr<f32>>
          %153 = tt.trans %147 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %154 = tt.dot %35, %153, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %155 = tt.expand_dims %152 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %156 = tt.broadcast %155 : tensor<1x64xf32> -> tensor<64x64xf32>
          %157 = arith.subf %154, %156 : tensor<64x64xf32>
          %158 = arith.mulf %143, %157 : tensor<64x64xf32>
          %159 = arith.mulf %158, %cst_3 : tensor<64x64xf32>
          %160 = arith.select %29, %159, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
          %161 = arith.select %28, %136, %cst_0 : tensor<64x1xi1>, tensor<64x1xi1>
          %162 = tt.broadcast %161 : tensor<64x1xi1> -> tensor<64x64xi1>
          %163 = arith.select %162, %160, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
          %164 = arith.truncf %163 : tensor<64x64xf32> to tensor<64x64xf16>
          %165 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %166 = tt.dot %164, %165, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          scf.yield %166, %149 : tensor<64x64xf32>, tensor<64x64xf32>
        } else {
          scf.yield %arg18, %arg19 : tensor<64x64xf32>, tensor<64x64xf32>
        }
        %99 = tt.addptr %arg15, %73 : !tt.ptr<i32>, i32
        %100 = tt.load %99 : !tt.ptr<i32>
        %101 = arith.muli %100, %c128_i32 : i32
        %102 = tt.addptr %arg14, %73 : !tt.ptr<i32>, i32
        %103 = tt.load %102 : !tt.ptr<i32>
        %104 = tt.splat %101 : i32 -> tensor<64xi32>
        %105 = arith.addi %104, %14 : tensor<64xi32>
        %106 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
        %107 = arith.muli %106, %cst_11 : tensor<1x64xi32>
        %108 = tt.addptr %83, %107 : tensor<1x64x!tt.ptr<f16>>, tensor<1x64xi32>
        %109 = tt.broadcast %108 : tensor<1x64x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %110 = tt.addptr %109, %87 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %111 = tt.expand_dims %105 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
        %112 = arith.muli %111, %cst_14 : tensor<64x1xi32>
        %113 = tt.addptr %91, %112 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
        %114 = tt.broadcast %113 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
        %115 = tt.addptr %114, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
        %116 = arith.muli %103, %c2_i32 : i32
        %117 = arith.minsi %116, %c1_i32 : i32
        %118 = arith.cmpi sge, %117, %c1_i32 : i32
        %119:2 = scf.if %118 -> (tensor<64x64xf32>, tensor<64x64xf32>) {
          %120 = arith.subi %117, %c1_i32 : i32
          %121:5 = scf.for %arg20 = %c0_i32 to %120 step %c1_i32 iter_args(%arg21 = %98#0, %arg22 = %98#1, %arg23 = %110, %arg24 = %115, %arg25 = %105) -> (tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>)  : i32 {
            %161 = tt.expand_dims %arg25 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
            %162 = arith.cmpi slt, %161, %cst_1 : tensor<1x64xi32>
            %163 = tt.broadcast %162 : tensor<1x64xi1> -> tensor<64x64xi1>
            %164 = tt.load %arg23, %163, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %165 = arith.cmpi slt, %arg25, %cst_17 : tensor<64xi32>
            %166 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %167 = tt.addptr %166, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %168 = tt.load %167, %165 : tensor<64x!tt.ptr<f32>>
            %169 = arith.cmpf oeq, %168, %cst_16 : tensor<64xf32>
            %170 = arith.select %169, %cst_15, %168 : tensor<64xi1>, tensor<64xf32>
            %171 = tt.dot %30, %164, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %172 = arith.mulf %171, %cst_13 : tensor<64x64xf32>
            %173 = arith.mulf %172, %cst_3 : tensor<64x64xf32>
            %174 = arith.mulf %173, %cst_2 : tensor<64x64xf32>
            %175 = tt.expand_dims %170 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %176 = tt.broadcast %175 : tensor<1x64xf32> -> tensor<64x64xf32>
            %177 = arith.subf %174, %176 : tensor<64x64xf32>
            %178 = math.exp2 %177 : tensor<64x64xf32>
            %179 = tt.expand_dims %arg25 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
            %180 = arith.cmpi slt, %179, %cst_12 : tensor<64x1xi32>
            %181 = tt.broadcast %180 : tensor<64x1xi1> -> tensor<64x64xi1>
            %182 = tt.load %arg24, %181, %cst_8 : tensor<64x64x!tt.ptr<f16>>
            %183 = arith.truncf %178 : tensor<64x64xf32> to tensor<64x64xf16>
            %184 = tt.dot %183, %182, %arg22, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %185 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
            %186 = tt.addptr %185, %arg25 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
            %187 = tt.load %186, %165 : tensor<64x!tt.ptr<f32>>
            %188 = tt.trans %182 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %189 = tt.dot %35, %188, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %190 = tt.expand_dims %187 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
            %191 = tt.broadcast %190 : tensor<1x64xf32> -> tensor<64x64xf32>
            %192 = arith.subf %189, %191 : tensor<64x64xf32>
            %193 = arith.mulf %178, %192 : tensor<64x64xf32>
            %194 = arith.mulf %193, %cst_3 : tensor<64x64xf32>
            %195 = arith.truncf %194 : tensor<64x64xf32> to tensor<64x64xf16>
            %196 = tt.trans %164 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
            %197 = tt.dot %195, %196, %arg21, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
            %198 = arith.divsi %arg20, %c2_i32 : i32
            %199 = tt.addptr %99, %198 : !tt.ptr<i32>, i32
            %200 = tt.load %199 evictionPolicy = evict_last : !tt.ptr<i32>
            %201 = arith.addi %198, %c1_i32 : i32
            %202 = arith.cmpi slt, %201, %103 : i32
            %203 = tt.addptr %199, %c1_i32 : !tt.ptr<i32>, i32
            %204 = tt.load %203, %202 evictionPolicy = evict_last : !tt.ptr<i32>
            %205 = arith.addi %arg20, %c1_i32 : i32
            %206 = arith.remsi %205, %c2_i32 : i32
            %207 = arith.cmpi eq, %206, %c0_i32 : i32
            %208 = arith.subi %204, %200 : i32
            %209 = arith.muli %208, %c128_i32 : i32
            %210 = arith.subi %209, %c64_i32 : i32
            %211 = arith.extui %207 : i1 to i32
            %212 = arith.muli %210, %211 : i32
            %213 = arith.subi %c1_i32, %211 : i32
            %214 = arith.muli %213, %c64_i32 : i32
            %215 = arith.addi %212, %214 : i32
            %216 = arith.muli %215, %c64_i32 : i32
            %217 = tt.splat %216 : i32 -> tensor<64x64xi32>
            %218 = tt.addptr %arg23, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %219 = tt.addptr %arg24, %217 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
            %220 = tt.splat %215 : i32 -> tensor<64xi32>
            %221 = arith.addi %arg25, %220 : tensor<64xi32>
            scf.yield %197, %184, %218, %219, %221 : tensor<64x64xf32>, tensor<64x64xf32>, tensor<64x64x!tt.ptr<f16>>, tensor<64x64x!tt.ptr<f16>>, tensor<64xi32>
          }
          %122 = tt.expand_dims %121#4 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
          %123 = arith.cmpi slt, %122, %cst_1 : tensor<1x64xi32>
          %124 = tt.broadcast %123 : tensor<1x64xi1> -> tensor<64x64xi1>
          %125 = tt.load %121#2, %124, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %126 = arith.cmpi slt, %121#4, %cst_17 : tensor<64xi32>
          %127 = tt.splat %67 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %128 = tt.addptr %127, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %129 = tt.load %128, %126 : tensor<64x!tt.ptr<f32>>
          %130 = arith.cmpf oeq, %129, %cst_16 : tensor<64xf32>
          %131 = arith.select %130, %cst_15, %129 : tensor<64xi1>, tensor<64xf32>
          %132 = tt.dot %30, %125, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %133 = arith.mulf %132, %cst_13 : tensor<64x64xf32>
          %134 = arith.mulf %133, %cst_3 : tensor<64x64xf32>
          %135 = arith.select %29, %134, %cst_6 : tensor<64x64xi1>, tensor<64x64xf32>
          %136 = arith.mulf %135, %cst_2 : tensor<64x64xf32>
          %137 = tt.expand_dims %131 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %138 = tt.broadcast %137 : tensor<1x64xf32> -> tensor<64x64xf32>
          %139 = arith.subf %136, %138 : tensor<64x64xf32>
          %140 = math.exp2 %139 : tensor<64x64xf32>
          %141 = tt.expand_dims %121#4 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
          %142 = arith.cmpi slt, %141, %cst_12 : tensor<64x1xi32>
          %143 = tt.broadcast %142 : tensor<64x1xi1> -> tensor<64x64xi1>
          %144 = tt.load %121#3, %143, %cst_8 : tensor<64x64x!tt.ptr<f16>>
          %145 = arith.truncf %140 : tensor<64x64xf32> to tensor<64x64xf16>
          %146 = tt.dot %145, %144, %121#1, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %147 = tt.splat %68 : !tt.ptr<f32> -> tensor<64x!tt.ptr<f32>>
          %148 = tt.addptr %147, %121#4 : tensor<64x!tt.ptr<f32>>, tensor<64xi32>
          %149 = tt.load %148, %126 : tensor<64x!tt.ptr<f32>>
          %150 = tt.trans %144 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %151 = tt.dot %35, %150, %cst_9, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          %152 = tt.expand_dims %149 {axis = 0 : i32} : tensor<64xf32> -> tensor<1x64xf32>
          %153 = tt.broadcast %152 : tensor<1x64xf32> -> tensor<64x64xf32>
          %154 = arith.subf %151, %153 : tensor<64x64xf32>
          %155 = arith.mulf %140, %154 : tensor<64x64xf32>
          %156 = arith.mulf %155, %cst_3 : tensor<64x64xf32>
          %157 = arith.select %29, %156, %cst_9 : tensor<64x64xi1>, tensor<64x64xf32>
          %158 = arith.truncf %157 : tensor<64x64xf32> to tensor<64x64xf16>
          %159 = tt.trans %125 {order = array<i32: 1, 0>} : tensor<64x64xf16> -> tensor<64x64xf16>
          %160 = tt.dot %158, %159, %121#0, inputPrecision = tf32 : tensor<64x64xf16> * tensor<64x64xf16> -> tensor<64x64xf32>
          scf.yield %160, %146 : tensor<64x64xf32>, tensor<64x64xf32>
        } else {
          scf.yield %98#0, %98#1 : tensor<64x64xf32>, tensor<64x64xf32>
        }
        scf.yield %119#0, %119#1 : tensor<64x64xf32>, tensor<64x64xf32>
      }
      %37 = tt.splat %13 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
      %38 = tt.addptr %37, %21 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
      %39 = tt.broadcast %38 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x64x!tt.ptr<f16>>
      %40 = tt.addptr %39, %26 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %41 = arith.cmpi slt, %24, %cst_11 : tensor<1x64xi32>
      %42 = tt.broadcast %41 : tensor<1x64xi1> -> tensor<64x64xi1>
      %43 = arith.andi %29, %42 : tensor<64x64xi1>
      %44 = arith.truncf %36#1 : tensor<64x64xf32> to tensor<64x64xf16>
      tt.store %40, %44, %43 : tensor<64x64x!tt.ptr<f16>>
      %45 = arith.mulf %36#0, %cst_13 : tensor<64x64xf32>
      %46 = tt.broadcast %21 : tensor<64x1xi32> -> tensor<64x64xi32>
      %47 = arith.addi %26, %46 : tensor<64x64xi32>
      %48 = tt.splat %4 : i32 -> tensor<64x64xi32>
      %49 = arith.addi %47, %48 : tensor<64x64xi32>
      %50 = tt.splat %8 : i32 -> tensor<64x64xi32>
      %51 = arith.addi %49, %50 : tensor<64x64xi32>
      %52 = tt.splat %arg16 : !tt.ptr<f16> -> tensor<64x64x!tt.ptr<f16>>
      %53 = tt.addptr %52, %51 : tensor<64x64x!tt.ptr<f16>>, tensor<64x64xi32>
      %54 = arith.truncf %45 : tensor<64x64xf32> to tensor<64x64xf16>
      tt.store %53, %54, %29 : tensor<64x64x!tt.ptr<f16>>
    }
    tt.return
  }
}

{-#
  external_resources: {
    mlir_reproducer: {
      pipeline: "builtin.module(convert-triton-to-tritongpu{enable-source-remat=false num-ctas=1 num-warps=4 target=cuda:100 threads-per-warp=32}, tritongpu-coalesce, tritongpu-F32DotTC, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, triton-nvidia-optimize-descriptor-encoding, triton-loop-aware-cse, tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, tritongpu-optimize-accumulator-init, tritongpu-hoist-tmem-alloc, tritongpu-promote-lhs-to-tmem, tritongpu-assign-latencies{num-stages=3}, tritongpu-schedule-loops, tritongpu-automatic-warp-specialization{num-stages=3}, tritongpu-pipeline{dump-intermediate-steps=false num-stages=3}, tritongpu-combine-tensor-select-and-if, triton-nvidia-gpu-remove-tmem-tokens, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-loop-aware-cse, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-coalesce-async-copy, triton-nvidia-optimize-tmem-layouts, tritongpu-remove-layout-conversions, triton-nvidia-interleave-tmem, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, triton-loop-aware-cse, symbol-dce, triton-nvidia-tma-lowering, triton-nvidia-gpu-fence-insertion{compute-capability=90}, sccp, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
      disable_threading: false,
      verify_each: true
    }
  }
#-}
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
/tmp/tmp0yiz3c94/p4/cp4ahrfnz4obsvzgftux7dg3aszopks2jljnoaz3eowlooi2scem.py:18:0: note: Pipeline failed while executing [`TritonGPUHoistTMEMAlloc` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
Triton compilation failed: triton_tem_fused_zeros_1
def triton_tem_fused_zeros_1(arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0):
    PRESCALE_QK : tl.constexpr = False
    ROWS_GUARANTEED_SAFE : tl.constexpr = False
    BLOCKS_ARE_CONTIGUOUS : tl.constexpr = False
    WRITE_DQ : tl.constexpr = True
    OUTPUT_LOGSUMEXP : tl.constexpr = True
    FLOAT32_PRECISION : tl.constexpr = 'tf32'
    IS_DIVISIBLE : tl.constexpr = False
    SM_SCALE : tl.constexpr = 0.125
    GQA_SHARED_HEADS : tl.constexpr = 4
    HAS_FULL_BLOCKS : tl.constexpr = True
    QK_HEAD_DIM : tl.constexpr = 64
    QK_HEAD_DIM_ROUNDED : tl.constexpr = 64
    V_HEAD_DIM : tl.constexpr = 64
    V_HEAD_DIM_ROUNDED : tl.constexpr = 64
    SAFE_HEAD_DIM : tl.constexpr = True
    BLOCK_M1 : tl.constexpr = 64
    BLOCK_N1 : tl.constexpr = 64
    BLOCK_M2 : tl.constexpr = 64
    BLOCK_N2 : tl.constexpr = 64
    SPARSE_Q_BLOCK_SIZE : tl.constexpr = 128
    SPARSE_KV_BLOCK_SIZE : tl.constexpr = 128
    Q = arg_Q
    K = arg_K
    V = arg_V
    LSE = arg_LSE
    DELTA = arg_DELTA
    DO = arg_DO
    DQ = arg_DQ
    DV = arg_DV
    KV_NUM_BLKS = arg_KV_NUM_BLKS
    KV_IDX = arg_KV_IDX
    Q_NUM_BLKS = arg_Q_NUM_BLKS
    Q_IDX = arg_Q_IDX
    FULL_KV_NUM_BLKS = arg_FULL_KV_NUM_BLKS
    FULL_KV_IDX = arg_FULL_KV_IDX
    FULL_Q_NUM_BLKS = arg_FULL_Q_NUM_BLKS
    FULL_Q_IDX = arg_FULL_Q_IDX

    # Sub notation for this kernel:
    #
    # Q: Query, K: Key, V: Value
    # LSE: logsumexp (logsumexp is always stored in fp32 regardless of the input dtype)
    # DELTA: Precomputed sum(OUT*DO, axis=-1)
    # DO: Derivative of Output, DQ: Derivative of Query, DV: Derivative of Value
    # DK: Derivative of Key, is the written to via the store_output call due to some limitations with
    # inductor codegen
    # M: Number of queries, N: Number of keys/values
    # QK_HEAD_DIM: The dimension of the query and key embeddings
    # V_HEAD_DIM: The dimension of the value embeddings
    # z: Batch size, h: Number of heads, m: Number of queries or keys/values, d: Head dim
    # GQA_SHARED_HEADS: number of query heads sharing one kv head in GQA setups.
    # (Modifiable) Performance tuning options
    # BLOCK_M1: when calculating DK & DV, iterate over BLOCK_M1 across the seqlen dim of Q in each thread block.
    # BLOCK_N1: when calculating DK & DV, the thread block size across the seqlen dim of K/V.
    # BLOCK_M2: when calculating DQ, the thread block size across the seqlen dim of Q.
    # BLOCK_N2: when calculating DQ, iterate over BLOCK_N2 across the seqlen dim of K/V in each thread block.
    #
    # The following FULL_* and PARTIAL_* is defined in the block sparse mask grid, rather than the thread block grid.
    # KV_NUM_BLKS: The number of KV blocks (that may or may not require masking) for each query.
    # KV_IDX: The indices of KV blocks (that may or may not require masking) for each query.
    # Q_NUM_BLKS: The number of Q blocks (that may or may not require masking) for each query.
    # Q_IDX: The indices of Q blocks (that may or may not require masking) for each query.
    # FULL_KV_NUM_BLKS: The number of fully unmasked KV blocks (so we don't need masking) for each query.
    # FULL_KV_IDX: The indices of fully unmasked KV blocks (so we don't need masking) for each query.
    # FULL_Q_NUM_BLKS: The number of fully unmasked Q blocks (so we don't need masking) for each query.
    # FULL_Q_IDX: The indices of fully unmasked Q blocks (so we don't need masking) for each query.

    # The below are kernel options that can be applied for certain score_mods,
    # or involve a numerics vs. perf tradeoff
    # PRESCALE_QK: Whether to pre-scale QK by 1/sqrt(d) and change of base. Has
    # about 20% more numerical error, but slightly faster.

    # Define strides of inputs
    stride_qz, stride_qh, stride_qm, stride_qd = 32768, 2048, 64, 1
    stride_kz, stride_kh, stride_kn, stride_kd = 65536, 16384, 64, 1
    stride_vz, stride_vh, stride_vn, stride_vd = 65536, 16384, 64, 1
    stride_doz, stride_doh, stride_dom, stride_dod = 32768, 2048, 64, 1

    stride_dqz, stride_dqh, stride_dqm, stride_dqd = 32768, 2048, 64, 1
    stride_dvz, stride_dvh, stride_dvm, stride_dvd = 65536, 16384, 64, 1

    ZQ = 2
    HQ = 16
    HKV = 4
    Q_LEN = 32
    ZKV = 2
    KV_LEN = 256

    MATMUL_PRECISION = Q.dtype.element_ty

    pid = tl.program_id(0)
    NUM_KV_BLOCKS = tl.cdiv(KV_LEN, BLOCK_N1)
    NUM_Q_BLOCKS = tl.cdiv(Q_LEN, BLOCK_M2)

    off_zq = tl.program_id(1) # q batch idx
    off_hkv = tl.program_id(2) # kv head idx
    off_zkv = off_zq % ZKV # kv batch idx

    SPARSE_Z = 2
    SPARSE_HQ = 16

    sparse_idx_z = off_zq % SPARSE_Z

    k_adj = (stride_kh * off_hkv + stride_kz * off_zkv).to(tl.int64)
    v_adj = (stride_vh * off_hkv + stride_vz * off_zkv).to(tl.int64)
    # first compute broadcasted dv of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
    # then reduce to dv of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
    dv_adj = (stride_dvh * off_hkv + stride_dvz * off_zq).to(tl.int64)

    # offset K, V, DV pointers for batch/kv-head
    K += k_adj
    V += v_adj
    DV += dv_adj

    RCP_LN2 = 1.44269504
    offs_k = tl.arange(0, QK_HEAD_DIM_ROUNDED)
    offs_v = tl.arange(0, V_HEAD_DIM_ROUNDED)

    if pid >= NUM_KV_BLOCKS:
        off_pid = pid - NUM_KV_BLOCKS
        # THIS BLOCK DOES DQ
        SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M2)
        SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N2)
        off_hq2 = off_pid // NUM_Q_BLOCKS + off_hkv * GQA_SHARED_HEADS
        start_m2_block = off_pid % NUM_Q_BLOCKS
        off_pid_mask = start_m2_block // SPARSE_Q_MULTIPLE
        stride_kv_num_blks_h = 1
        stride_kv_idx_h = 2
        stride_kv_idx_m = 2

        sparse_idx_hq2 = off_hq2 % SPARSE_HQ
        sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq2

        sparse_kv_num_blks_offset = sparse_hz_offset * stride_kv_num_blks_h + off_pid_mask
        sparse_kv_idx_offset = sparse_hz_offset * stride_kv_idx_h + off_pid_mask * stride_kv_idx_m  # noqa: B950

        # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads.
        q_adj2 = (stride_qh * off_hq2 + stride_qz * off_zq).to(tl.int64)
        do_adj2 = (stride_doh * off_hq2 + stride_doz * off_zq).to(tl.int64)
        dq_adj2 = (stride_dqh * off_hq2 + stride_dqz * off_zq).to(tl.int64)
        off_chz2 = ((off_zq * HQ + off_hq2) * Q_LEN).to(tl.int64)

        Q2 = Q + q_adj2
        DO2 = DO + do_adj2
        # TODO: This does not work if DQ is not the same layout as Q (for example,
        # if Q is broadcasted)
        DQ2 = DQ + dq_adj2
        LSE2 = LSE + off_chz2
        DELTA2 = DELTA + off_chz2

        # dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM], dtype=tl.float32)
        dq = tl.zeros([BLOCK_M2, QK_HEAD_DIM_ROUNDED], dtype=tl.float32)

        start_m2 = start_m2_block * BLOCK_M2
        offs_m2 = start_m2 + tl.arange(0, BLOCK_M2)

        # load Q and do: they stay in SRAM throughout the inner loop.
        q = load_checked_2d(Q2, offs_m2, offs_k, stride_qm, stride_qd, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, QK_HEAD_DIM)
        do = load_checked_2d(DO2, offs_m2, offs_v, stride_dom, stride_dod, IS_DIVISIBLE, SAFE_HEAD_DIM, Q_LEN, V_HEAD_DIM)

        if PRESCALE_QK:
            q = (q * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION)

        if IS_DIVISIBLE:
            Di = tl.load(DELTA2 + offs_m2)
            lse = tl.load(LSE2 + offs_m2)
        else:
            Di = tl.load(DELTA2 + offs_m2, mask=offs_m2 < Q_LEN)
            lse = tl.load(LSE2 + offs_m2, mask=offs_m2 < Q_LEN)
        lse = tl.where(lse == -float("inf"), 0.0, lse)
        lse = lse[:, None]

        # ~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        # KV_IDX and KV_NUM_BLKS are always contiguous.
        kv_indices = KV_IDX + sparse_kv_idx_offset
        kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading
        sparse_kv_num_blocks = tl.load(KV_NUM_BLKS + sparse_kv_num_blks_offset)

        offs_n2 = kv_start + tl.arange(0, BLOCK_N2)
        dq = bwd_dq_inner(
            arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
            K, V,
            dq, q, do, Di, lse,
            off_zq, off_hq2, offs_m2, offs_n2,
            stride_kn, stride_kd, stride_vn, stride_vd,
            kv_indices, sparse_kv_num_blocks,
            MATMUL_PRECISION,
            IS_FULL_BLOCKS=False,
        )

        if HAS_FULL_BLOCKS:
            # ~~~~~~~~~~~ partial unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # FULL_KV_IDX and FULL_KV_NUM_BLKS are always contiguous.
            kv_indices = FULL_KV_IDX + sparse_kv_idx_offset
            kv_start = tl.load(kv_indices) * SPARSE_KV_BLOCK_SIZE # first kv block we're loading
            sparse_kv_num_blocks = tl.load(FULL_KV_NUM_BLKS + sparse_kv_num_blks_offset)

            offs_n2 = kv_start + tl.arange(0, BLOCK_N2)
            dq = bwd_dq_inner(
                arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
                K, V,
                dq, q, do, Di, lse,
                off_zq, off_hq2, offs_m2, offs_n2,
                stride_kn, stride_kd, stride_vn, stride_vd,
                kv_indices, sparse_kv_num_blocks,
                MATMUL_PRECISION,
                IS_FULL_BLOCKS=True,
            )

        # Write back dQ.
        dq_ptrs = DQ2 + offs_m2[:, None] * stride_dqm + offs_k[None, :] * stride_dqd
        dq *= SM_SCALE
        if IS_DIVISIBLE and SAFE_HEAD_DIM:
            tl.store(dq_ptrs, dq)
        else:
            tl.store(dq_ptrs, dq, mask=(offs_m2[:, None] < Q_LEN) & (offs_k[None, :] < QK_HEAD_DIM))
    else:
        # THIS BLOCK DOES DK & DV
        SPARSE_Q_MULTIPLE = (SPARSE_Q_BLOCK_SIZE // BLOCK_M1)
        SPARSE_KV_MULTIPLE = (SPARSE_KV_BLOCK_SIZE // BLOCK_N1)

        pid_mask = pid // SPARSE_KV_MULTIPLE

        stride_q_num_blks_h = 2
        stride_q_idx_h = 2
        stride_q_idx_n = 1

        dv = tl.zeros([BLOCK_N1, V_HEAD_DIM_ROUNDED], dtype=tl.float32)
        dk = tl.zeros([BLOCK_N1, QK_HEAD_DIM_ROUNDED], dtype=tl.float32)

        start_n1 = pid * BLOCK_N1
        offs_n1 = start_n1 + tl.arange(0, BLOCK_N1)

        # load K and V: they stay in SRAM throughout the inner loop.
        k = load_checked_2d(K, offs_n1, offs_k, stride_kn, stride_kd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, QK_HEAD_DIM)
        v = load_checked_2d(V, offs_n1, offs_v, stride_vn, stride_vd, IS_DIVISIBLE, SAFE_HEAD_DIM, KV_LEN, V_HEAD_DIM)

        if PRESCALE_QK:
            k = (k * SM_SCALE * RCP_LN2).to(MATMUL_PRECISION)

        for off_g in range(0, GQA_SHARED_HEADS):
            off_hq1 = off_hkv * GQA_SHARED_HEADS + off_g

            # Offset Q, DQ, DO, DELTA & LSE. These inputs are offsetted by query heads.
            q_adj1 = (stride_qh * off_hq1 + stride_qz * off_zq).to(tl.int64)
            do_adj1 = (stride_doh * off_hq1 + stride_doz * off_zq).to(tl.int64)
            dq_adj1 = (stride_dqh * off_hq1 + stride_dqz * off_zq).to(tl.int64)
            off_chz1 = ((off_zq * HQ + off_hq1) * Q_LEN).to(tl.int64)

            Q1 = Q + q_adj1
            DO1 = DO + do_adj1
            # TODO: This does not work if DQ is not the same layout as Q (for example,
            # if Q is broadcasted)
            LSE1 = LSE + off_chz1
            DELTA1 = DELTA + off_chz1

            sparse_idx_hq1 = off_hq1 % SPARSE_HQ
            sparse_hz_offset = sparse_idx_z * SPARSE_HQ + sparse_idx_hq1

            sparse_q_num_blks_offset = sparse_hz_offset * stride_q_num_blks_h + pid_mask
            sparse_q_idx_offset = sparse_hz_offset * stride_q_idx_h + pid_mask * stride_q_idx_n  # noqa: B950

            # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # Q_IDX and Q_NUM_BLKS are always contiguous.
            q_indices = Q_IDX + sparse_q_idx_offset
            q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading
            sparse_q_num_blocks = tl.load(Q_NUM_BLKS + sparse_q_num_blks_offset)

            offs_m1 = q_start + tl.arange(0, BLOCK_M1)
            dk, dv = bwd_dkdv_inner(
                arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
                Q1, DO1, DELTA1, LSE1,
                dk, dv, k, v,
                off_zq, off_hq1, offs_n1, offs_m1,
                stride_qm, stride_qd, stride_dom, stride_dod,
                q_indices, sparse_q_num_blocks,
                MATMUL_PRECISION,
                IS_FULL_BLOCKS=False,
            )

            if HAS_FULL_BLOCKS:
                # ~~~~~~~~~~~~~~~ fully unmasked blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                # FULL_Q_IDX and FULL_Q_NUM_BLKS are always contiguous.
                q_indices = FULL_Q_IDX + sparse_q_idx_offset
                q_start = tl.load(q_indices) * SPARSE_Q_BLOCK_SIZE # first q block we're loading
                sparse_q_num_blocks = tl.load(FULL_Q_NUM_BLKS + sparse_q_num_blks_offset)

                offs_m1 = q_start + tl.arange(0, BLOCK_M1)
                dk, dv = bwd_dkdv_inner(
                    arg_Q, arg_K, arg_V, arg_LSE, arg_DELTA, arg_DO, arg_DQ, arg_DV, arg_KV_NUM_BLKS, arg_KV_IDX, arg_Q_NUM_BLKS, arg_Q_IDX, arg_FULL_KV_NUM_BLKS, arg_FULL_KV_IDX, arg_FULL_Q_NUM_BLKS, arg_FULL_Q_IDX, out_ptr0,
                    Q1, DO1, DELTA1, LSE1,
                    dk, dv, k, v,
                    off_zq, off_hq1, offs_n1, offs_m1,
                    stride_qm, stride_qd, stride_dom, stride_dod,
                    q_indices, sparse_q_num_blocks,
                    MATMUL_PRECISION,
                    IS_FULL_BLOCKS=True,
                )

        # Write back dV and dK.
        dv_ptrs = DV + offs_n1[:, None] * stride_dvm + offs_v[None, :] * stride_dvd

        index_n = offs_n1[:, None]
        index_k = offs_k[None, :]
        index_v = offs_v[None, :]

        if IS_DIVISIBLE and SAFE_HEAD_DIM:
            tl.store(dv_ptrs, dv)
        else:
            tl.store(dv_ptrs, dv, mask=(index_n < KV_LEN) & (index_v < V_HEAD_DIM))

        dk *= SM_SCALE

        if SAFE_HEAD_DIM:
            mask = index_n < KV_LEN
        else:
            mask = (index_n < KV_LEN) & (index_k < QK_HEAD_DIM)

        # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
        # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
        xindex = index_k + 64*index_n + 16384*off_hkv + 65536*off_zq
        tl.store(out_ptr0 + (tl.broadcast_to(xindex, dk.shape)), dk, mask)

metadata: {'signature': {'arg_Q': '*fp16', 'arg_K': '*fp16', 'arg_V': '*fp16', 'arg_LSE': '*fp32', 'arg_DELTA': '*fp32', 'arg_DO': '*fp16', 'arg_DQ': '*fp16', 'arg_DV': '*fp16', 'arg_KV_NUM_BLKS': '*i32', 'arg_KV_IDX': '*i32', 'arg_Q_NUM_BLKS': '*i32', 'arg_Q_IDX': '*i32', 'arg_FULL_KV_NUM_BLKS': '*i32', 'arg_FULL_KV_IDX': '*i32', 'arg_FULL_Q_NUM_BLKS': '*i32', 'arg_FULL_Q_IDX': '*i32', 'out_ptr0': '*fp16'}, 'device': 0, 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (4,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (9,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (14,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 4, 'num_stages': 3, 'debug': True, 'cc': 100}
Traceback (most recent call last):
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile
    next_module = compile_ir(module, metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda>
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir
    pm.run(mod)
RuntimeError: PassManager::run failed
frames [('total', 3), ('ok', 3)]
inline_call []
stats [('calls_captured', 8), ('unique_graphs', 3)]
aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('ok', 1)]
inductor [('triton_bundler_save_kernel', 8), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1), ('fxgraph_cache_bypass', 1)]
graph_break []
F

==================================================== FAILURES =====================================================
_____________________________ TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16 ______________________________
Traceback (most recent call last):
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
    method(*args, **kwargs)
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 446, in instantiated_test
    raise rte
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1349, in dep_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/testing/_internal/common_device_type.py", line 1215, in dep_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 1430, in test_GQA
    self.run_test(*inputs)
  File "/home/drisspg/meta/pytorch/test/inductor/test_flex_attention.py", line 566, in run_test
    compiled_out.backward(backward_grad)
  File "/home/drisspg/meta/pytorch/torch/_tensor.py", line 625, in backward
    torch.autograd.backward(
  File "/home/drisspg/meta/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/drisspg/meta/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/autograd/function.py", line 315, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2303, in backward
    return impl_fn()
           ^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2289, in impl_fn
    out = CompiledFunction._backward_impl(ctx, all_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2394, in _backward_impl
    CompiledFunction.compiled_bw = aot_config.bw_compiler(
                                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_functorch/_aot_autograd/schemas.py", line 1256, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_dynamo/backends/common.py", line 76, in _wrapped_bw_compiler
    disable(
  File "/home/drisspg/meta/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_utils_internal.py", line 92, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 2428, in bw_compiler
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 773, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_dynamo/repro/after_aot.py", line 124, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 952, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1652, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/compile_fx.py", line 1506, in codegen_and_compile
    compiled_module = graph.compile_to_module()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2318, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2328, in _compile_to_module
    mod = self._compile_to_module_lines(wrapper_code)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/graph.py", line 2396, in _compile_to_module_lines
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/codecache.py", line 3466, in load_by_key_path
    mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/compile_tasks.py", line 33, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/tmp0yiz3c94/az/caza2gzmsagyuusmf2ka3oat3na4xv6zudssk244xmlzsbv2knze.py", line 117, in <module>
  File "/home/drisspg/meta/pytorch/torch/_inductor/async_compile.py", line 489, in triton
    kernel.precompile(
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 437, in precompile
    self._precompile_worker()
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 459, in _precompile_worker
    compile_results.append(self._precompile_config(c))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/meta/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 748, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/compiler/compiler.py", line 359, in compile
    next_module = compile_ir(module, metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 456, in <lambda>
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drisspg/.conda/envs/dev/lib/python3.12/site-packages/triton/backends/nvidia/compiler.py", line 298, in make_ttgir
    pm.run(mod)
RuntimeError: PassManager::run failed

To execute this test, run the following from the base repo dir:
    python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_score_mod1_cuda_float16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
============================================= short test summary info =============================================
FAILED [5.1441s] test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_GQA_score_mod1_cuda_float16 - RuntimeError: PassManager::run failed
================================== 1 failed, 1 passed, 1404 deselected in 18.10s ==================================
~/meta/pytorch flex-warning !1 ❯
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160227
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2025-08-11 23:30:20 +00:00
99bc2f94c1 Update export/schema.py (#160220)
Summary:
Model could have multiple ExportedPrograms
- for different methods. They can have different weights.
- for different delegates. They can also have different weights.

For this reason, we make weight per ExportedProgram.

Also, we cleanup Model, and Program. IIUC, Model and Program are not used anywhere, so it's ok to make BC breaking change.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79917395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160220
Approved by: https://github.com/angelayi, https://github.com/dolpm, https://github.com/jingsh
2025-08-11 23:14:08 +00:00
fc25c68f20 [hop][exc] make UncapturedHigherOrderOpError print user code and avoid re-raise (#159296)
After the change, the error stacktrace is attached with user code stack and  is suppressed into 1 (without the scrolling up mssage). For example:
```python
    class Test(torch.nn.Module):
        def forward(self, c, x):
            def cond_fn(c, x):
                return c > 0 and x.size(0) < 20

            def body_fn(c, x):
                return c - 1, x.sin()

            return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x))
```

Now gives the following error message:
```python
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1705, in test_while_loop_size_mismatch_tensor_expansion
    self._run_test(
    ~~~~~~~~~~~~~~^
        model=WhileLoopModels.SizeMismatchTensorExpansion(),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        dynamic=dynamic,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1417, in _run_test
    result = model(*inputs_with_counters)
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1053, in forward
    return torch._higher_order_ops.while_loop(cond_fn, body_fn, (c, x))
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 176, in while_loop
    return torch.compile(
           ~~~~~~~~~~~~~~
        _while_loop_op_wrapper, backend=backend, fullgraph=True
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    )(flat_cond_fn, flat_body_fn, tuple(flat_inputs), tuple())
    ~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 804, in compile_wrapper
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1595, in __call__
    result = self._torchdynamo_orig_backend(
        frame, cache_entry, self.hooks, frame_state, skip=1
    )
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1353, in __call__
    result = self._inner_convert(
        frame, cache_entry, hooks, frame_state, skip=skip + 1
    )
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 682, in __call__
    result = _compile(
        frame.f_code,
    ...<16 lines>...
        convert_frame_box=self._box,
    )
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 1172, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_utils_internal.py", line 98, in wrapper_function
    return function(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 858, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 897, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1461, in transform_code_object
    transformations(instructions, code_options)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 300, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 818, in transform
    tracer.run()
    ~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3528, in run
    super().run()
    ~~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run
    while self.step():
          ~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step
    self.dispatch_table[inst.opcode](self, inst)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 91, in graph_break_as_hard_error
    raise exc.with_traceback(sys.exc_info()[2]) from None
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 77, in graph_break_as_hard_error
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1287, in call_function
    ) = speculate_subgraph(
        ~~~~~~~~~~~~~~~~~~^
        tx,
        ^^^
    ...<33 lines>...
        supports_aliasing=self.supports_aliasing,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 877, in speculate_subgraph
    raise ex
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 718, in speculate_subgraph
    output = f.call_function(tx, args, sub_kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function
    return super().call_function(tx, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call
    return tracer.inline_call_()
           ~~~~~~~~~~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_
    self.run()
    ~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run
    while self.step():
          ~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step
    self.dispatch_table[inst.opcode](self, inst)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 852, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2240, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1200, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 580, in call_function
    return super().call_function(tx, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 334, in call_function
    return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1217, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3733, in inline_call
    return tracer.inline_call_()
           ~~~~~~~~~~~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 3936, in inline_call_
    self.run()
    ~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1372, in run
    while self.step():
          ~~~~~~~~~^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1276, in step
    self.dispatch_table[inst.opcode](self, inst)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 830, in inner
    unimplemented_v2(
    ~~~~~~~~~~~~~~~~^
        gb_type="Data-dependent branching",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
        ],
        ^^
    )
    ^
  File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 580, in unimplemented_v2
    raise Unsupported(msg)
torch._dynamo.exc.UncapturedHigherOrderOpError: while_loop doesn't work unless it is captured completely with torch.compile. Got Data-dependent branching
  Explanation: Detected data-dependent branching (e.g. `if my_tensor.sum() > 0:`). Dynamo does not support tracing dynamic control flow.
  Hint: This graph break is fundamental - it is unlikely that Dynamo will ever be able to trace through your code. Consider finding a workaround.
  Hint: Use `torch.cond` to express dynamic control flow.

  Developer debug context: attempted to jump with TensorVariable()

 For more details about this graph break, please visit: https://pytorch-labs.github.io/compile-graph-break-site/gb/gb0170.html

from user code:
   File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 167, in _while_loop_op_wrapper
    return while_loop_op(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_higher_order_ops/while_loop.py", line 137, in flat_cond_fn
    return cond_fn(*carried, *additional)
  File "/home/yidi/local/pytorch/test/inductor/test_control_flow.py", line 1047, in cond_fn
    return c > 0 and x.size(0) < 20

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

To execute this test, run the following from the base repo dir:
    python test/inductor/test_control_flow.py WhileLoopTests.test_while_loop_size_mismatch_tensor_expansion_device_cpu_dynamic_False

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159296
Approved by: https://github.com/zou3519
2025-08-11 22:48:10 +00:00
5a40c57844 [MTIA] Implement isAvailable() for MTIA hooks (#160304)
Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods.

Test Plan:
`torch._C._get_accelerator()`

Rollback Plan:

Differential Revision: D79981115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304
Approved by: https://github.com/nautsimon
2025-08-11 21:45:11 +00:00
7d2ec704e4 Fix MPS autocast for ConvTranspose3d (#160345)
## Summary
- ensure ConvTranspose3d uses fp32 under MPS autocast
- add MPS autocast test for ConvTranspose3d

Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c

Fixes https://github.com/pytorch/pytorch/issues/160332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345
Approved by: https://github.com/dcci
2025-08-11 21:01:52 +00:00
fc80f6859e Fix collective schedule logging and runtime tests (#160260)
Summary:

- Fix collective schedule logging so that only logs when collectives present
- Fix runtime estimate test to check if each op has a number value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160260
Approved by: https://github.com/Skylion007
2025-08-11 20:58:52 +00:00
cf0a0dcb0a Make user defined Triton kernels serializable for fx_graph_runnable (#160002)
Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002
Approved by: https://github.com/eellison
2025-08-11 20:54:33 +00:00
b149c7204c Revert "port distributed pipeline test files for Intel GPU (#159033)"
This reverts commit 76a0609b6bddb2bc40f1eb4ade12885023653d59.

Reverted https://github.com/pytorch/pytorch/pull/159033 on behalf of https://github.com/clee2000 due to broke test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/16890370216/job/47849586456) [HUD commit link](76a0609b6b) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/159033#issuecomment-3176833314))
2025-08-11 20:44:45 +00:00
09381f5dac Revert "[Graph Partition] Pass all OSS unit tests (#154667)"
This reverts commit ca7315c17162ea21b1ca5ba23f4bf6168766c7b9.

Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](ca7315c171) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))
2025-08-11 20:34:27 +00:00
9eedd2a20b [PGO] no counterfactual suggestions for dynamic allowlist (#160231)
Being more conservative with whitelist suggestions as we roll out suggestions; now we only suggest sources that were dynamic in previous runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160231
Approved by: https://github.com/bobrenjc93
2025-08-11 20:13:25 +00:00
c3dc8dc412 159965 is merged, no need to patch it in (#160275)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160275
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-08-11 19:55:04 +00:00
76a0609b6b port distributed pipeline test files for Intel GPU (#159033)
In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. instantiate_device_type_tests()
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
4. use "get_default_backend_for_device()" to get backend
5. enabled XPU for some test path
6. add TEST_MULTIACCELERATOR in common_utils for all backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033
Approved by: https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Daisy Deng <daisy.deng@intel.com>
2025-08-11 19:43:15 +00:00
c8205cb354 [autograd] match 0-dim gradients device type regardless of subclassness (#160165)
Not sure if there some subclasses where the outer.dim() == 0 but you wouldn't want to move it?

FIXES https://github.com/pytorch/pytorch/issues/160084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160165
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-08-11 17:57:32 +00:00
d25c4f954d [MPS] Type-promote tensor-iterator common dtype (#160334)
Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels

Fixes https://github.com/pytorch/pytorch/issues/160208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-08-11 17:53:56 +00:00
d0e2240f68 [triton_heuristics] Optimize the triton launcher in pt2 (#160000)
Summary:

(Original author: Xu Zhao. Commandeered by David to land this since it is relatively urgent)

We observed ~10us PT2-Triton launch overhead regression after pin update.

Before Triton pin-update:
 {F1980557238}

After Triton pin-update:
 {F1980557240}

The root cause is because https://github.com/pytorch/pytorch/pull/145051 adds `_get_args_with_constexprs` to the cubin launcher caller function, which is on the critical path.

The motivation for `_get_args_with_constexprs` was that between triton 3.2 and triton 3.3, the convention for calling Triton kernels (at the level that non-static-cuda-launcher inductor integrates) changed. Previously, the callable did not take constexpr arguments as parameters; after 3.3, it does. With pointwise/reduction kernels, we don't know the constexpr values until after autotuning occurs; so `_get_args_with_constexprs` would inject constexprs into the arguments list before calling the Triton kernel. The fix (in this PR) is to instead inject the constexpr args into the launcher string - this avoids the cost of sorting/reordering arguments which previously occurred upon execution of each kernel.

Note that the static_cuda_launcher.py does not require constants to be passed to the cubin launcher (e96c7c4bb0/torch/_inductor/runtime/static_cuda_launcher.py (L220)), there is no need to pass in constexprs to the generated launcher code.

The new launcher code needs to work on three cases:
- StaticallyLaunchedCudaKernel
- triton.compile.CompiledKernel
- AOTInductor

Analysis: https://docs.google.com/document/d/1PHaSmx2w59K8qpjw5_qzKWShfEgptf_Zpv_DL7YxiWU/edit?tab=t.0

Test Plan:
Before:
```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.893x
```

```

$ buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00760921                       1.80298                                   0.623282                         5.25024                                  0.203722
     19                      0.00799885                       4.78223                                   1.00226                          5.8213                                   0.239084
average                      0.00780403                       3.29261                                   0.812769                         5.53577                                  0.221403
```

After:

```
buck2 run mode/opt //pytorch/tritonbench:run -- --op launch_latency
  x_val    nop_python_function-walltime    nop_triton_kernel-walltime    nop_triton_compiled_kernel_run-walltime    nop_inductor_kernel-walltime    nop_inductor_kernel_cudagraph-walltime
-------  ------------------------------  ----------------------------  -----------------------------------------  ------------------------------  ----------------------------------------
      0                      0.00747067                       1.92589                                   0.726509                         4.35459                                  0.204205
     19                      0.00747823                       7.36852                                   1.26241                          6.28208                                  0.239278
average                      0.00747445                       4.6472                                    0.994459                         5.31834                                  0.221741
```

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --performance --backend=inductor --training --amp --disable-cudagraphs

1.985x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160000
Approved by: https://github.com/jansel

Co-authored-by: Xu Zhao <xzhao9@meta.com>
2025-08-11 17:22:40 +00:00
9ccd0f5e31 Fix unbacked symint and memory leak in inductor memory planning (#159839)
Summary:

In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so **allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints** .

So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin.

Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range.

If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool.

We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor.

In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool`  actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`.

```
AtenTensorHandle handle_name;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....);
```

This is bug prune. **If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well**, otherwise you get memory leaks.

We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool.

Test Plan:
```
 python test/inductor/test_memory_planning.py
```

Rollback Plan:

Differential Revision: D79603119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839
Approved by: https://github.com/jansel
2025-08-11 17:16:15 +00:00
ca7315c171 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-11 16:25:12 +00:00
68a4b4b2e3 [codemod] Fix unreachable-break issue in caffe2/c10/cuda/CUDAFunctions.cpp +2 (#160257)
Summary:
LLVM has a warning `-Wunreachable-code-break` which identifies `break` statements that cannot be reached. These compromise readability, are misleading, and may identify bugs. This diff removes such statements.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan:
Sandcastle

Rollback Plan:

Differential Revision: D79835614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160257
Approved by: https://github.com/Skylion007
2025-08-11 16:09:24 +00:00
80cca83079 [inductor] Skip some AOTI UTs on Windows. (#160287)
Skip some AOTI UTs on Windows, it is not fully ready.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160287
Approved by: https://github.com/ezyang
2025-08-11 13:50:43 +00:00
515cb70367 [inductor] normalize_path_separator for test_different_file_paths_local_pgo (#160286)
`normalize_path_separator` for test_different_file_paths_local_pgo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160286
Approved by: https://github.com/ezyang
2025-08-11 13:50:18 +00:00
cyy
c184cb3852 [submodule] Bump fbgemm to latest (#158210)
Merge the recent commits of FBGEMM and remove unnecessary CMake code.
Specifically, we
1. enable `fbgemm_autovec` since the target is now correctly handled.
2. remove option `USE_FAKELOWP` which is not used.
3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210
Approved by: https://github.com/q10
2025-08-11 13:48:02 +00:00
2259dbed4e Update slow tests (#158222)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158222
Approved by: https://github.com/pytorchbot
2025-08-11 12:00:13 +00:00
05029ad1c3 [xla hash update] update the pinned xla hash (#160306)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160306
Approved by: https://github.com/pytorchbot
2025-08-11 11:28:49 +00:00
cyy
cf4964be68 Remove unnecessary CMake checks for glog (#158185)
With the updating to CMake 2.27, some old scripts can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158185
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-08-11 10:14:47 +00:00
ecea81117b Fix clang builds by adding headers (#160252)
Clang compiler from llvm-14 fails to build full torch from source with the message
```
no template named 'unordered_map' in namespace 'std'
  std::unordered_map<std::string, HandlerFunc> handlers_{};
 ~~~~~^
```
A similar issue here https://github.com/intel/llvm/issues/5264
Fix is to add the correct headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160252
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-08-11 09:03:14 +00:00
1c2cba17ea [FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119)
To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional)

Screenshot:

<img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" />

<img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119
Approved by: https://github.com/H-Huang, https://github.com/wconstab
2025-08-11 07:27:10 +00:00
ff0d56d035 [Inductor] [Triton] Enable Configuration warmup/rep iterations when benchmarking in inductor (#159982)
Summary:
When benchmarking on B200 Max Autotune, I discovered that the estimations from the autotune logs consistently produced a better ATEN result by > 20% on an example shape. Here is an example of the output:

```
Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3081120103597641, "best_triton_pos": 1, "best_triton_time": 0.6589759886264801, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"}
AUTOTUNE mm(3840x1152, 1152x49136)
strides: [1, 3840], [49152, 1]
dtypes: torch.bfloat16, torch.bfloat16
  mm 0.3081 ms 100.0%
  triton_mm_16 0.6590 ms 46.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_17 0.6830 ms 45.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_13 0.7015 ms 43.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_9 0.8487 ms 36.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_11 0.8695 ms 35.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_10 0.8797 ms 35.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_18 0.9089 ms 33.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_14 0.9718 ms 31.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_15 1.0169 ms 30.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
SingleProcess AUTOTUNE benchmarking takes 2.8574 seconds and 0.1032 seconds precompiling for 20 choices
Removed 3483 outliers from 28645 samples
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.00s/it]
          (M, N, K)    pt2_matmul_maxautotune-latency    pt2_matmul_maxautotune-speedup    pt2_matmul_maxautotune-tflops
-------------------  --------------------------------  --------------------------------  -------------------------------
(3840, 49136, 1152)                 0.359392 (±8.27%)                                                            1209.61
            average                                                                                              1209.61
```

Based on my reading about B200 power usage, I believe this is due to the new for power aware benchmarking as a kernel may perform better in short bursts. This adds environment variables to expand autotuning iterations so we can get more consistent results between the estimation and the actual runtime. I did not update the default yet, even for B200 because I'm not sure how this is used in practice.

This is the new output:

```
Autotune Choices Stats:
{"num_choices": 20, "num_triton_choices": 19, "best_kernel": "mm", "best_time": 0.3848319947719574, "best_triton_pos": 1, "best_triton_time": 0.6287680268287659, "best_triton_kernel": "triton_mm_16", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"}
AUTOTUNE mm(3840x1152, 1152x49136)
strides: [1, 3840], [49152, 1]
dtypes: torch.bfloat16, torch.bfloat16
  mm 0.3848 ms 100.0%
  triton_mm_16 0.6288 ms 61.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_13 0.6299 ms 61.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_17 0.6728 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_9 0.7189 ms 53.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_18 0.8566 ms 44.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_11 0.8693 ms 44.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_14 0.9298 ms 41.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_10 0.9524 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_15 1.0216 ms 37.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
SingleProcess AUTOTUNE benchmarking takes 3.9245 seconds and 0.0965 seconds precompiling for 20 choices
Removed 3537 outliers from 29530 samples
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.70s/it]
          (M, N, K)    pt2_matmul_maxautotune-latency    pt2_matmul_maxautotune-speedup    pt2_matmul_maxautotune-tflops
-------------------  --------------------------------  --------------------------------  -------------------------------
(3840, 49136, 1152)                 0.359328 (±9.71%)                                                            1209.82
            average                                                                                              1209.82
```

Test Plan:
`TORCH_AUTOTUNE_REP=1000 CUDA_VISIBLE_DEVICES=2 ENABLE_MMA_V5_ATT_PIPELINE=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run mode/opt  //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op gemm --iter $NUM_ITERS --input-loader /home/njriasan/parsed_shapes.json --only pt2_matmul_maxautotune`

Rollback Plan:

Reviewed By: NikhilAPatel

Differential Revision: D79737929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159982
Approved by: https://github.com/NikhilAPatel
2025-08-11 05:27:51 +00:00
334b38ccc4 Fix typo in README.md (#160160)
The "Get the PyTorch Source" section is now located before the "Install Dependencies/Common" section, so "... using the “Get the PyTorch Source“ section below" should be "... using the “Get the PyTorch Source“ section above".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160160
Approved by: https://github.com/BoyuanFeng
2025-08-11 05:09:59 +00:00
dc0d18e023 [CUDA] Remove the uncessary CUDA_GUARD (#160249)
`CUDA_GUARD` is unnecessary in `initDeviceStreamState`, because
the `initSingleStream` has already done it.

29712314dd/c10/cuda/CUDAStream.cpp (L202-L203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160249
Approved by: https://github.com/Skylion007
2025-08-11 05:08:05 +00:00
cyy
8ae4d2652f Tidy torch/csrc/jit/passes/onnx code (#160262)
Apply clang-tidy fixes to torch/csrc/jit/passes/onnx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160262
Approved by: https://github.com/justinchuby
2025-08-11 04:50:38 +00:00
8088cfa592 Add type assert for tensor_meta, based on real bug in autoparallel. (#157927)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157927
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/wconstab
2025-08-11 04:22:02 +00:00
d8cb3db533 Add unsigned support to IValue (#160102)
- Moved repeated logic of saving int64/uint64 into a polymorphic container into `THPUtils_unpackInteger`
- Added `TestPythonDispatch.test_dispatch_uint64` regression test

Fixes https://github.com/pytorch/pytorch/issues/159168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160102
Approved by: https://github.com/ezyang
2025-08-11 03:57:18 +00:00
e7152ff8a6 [inductor] fix some windows inductor UTs (#160292)
This PR is the UT part of https://github.com/pytorch/pytorch/pull/160161. As @malfet 's comments: https://github.com/pytorch/pytorch/pull/160161#pullrequestreview-3103812178 This PR will not land turn on change, and only land UT part.

changes:
1. Fixed `test_invalid_artifact_flag_error_msg`.
2. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`.
3. Skiped whole UT `test_cpu_select_algorithm.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160292
Approved by: https://github.com/malfet
2025-08-11 02:55:37 +00:00
842cc77ab9 [MPS] Extend addmm to integral types (#160270)
By adding `addmm` kernel, which is a logical continuation  of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float

Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via
```
 PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_
```

Fixes https://github.com/pytorch/pytorch/issues/154901
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #160228, #160234
2025-08-11 00:54:17 +00:00
b602ea9cab Revert "[inductor] turn on windows inductor UTs (#160161)"
This reverts commit 4416433c7c625127b7f975c92f8ec98ea4c67fd3.

Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/xuhancn due to auto merged with two related issue ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172982125))
2025-08-11 00:04:25 +00:00
4416433c7c [inductor] turn on windows inductor UTs (#160161)
With this PR, we can turn on the inductor UTs on Windows CPU.

changes:
1. Turn on inductor UTs on Windows CPU.
2. Add a shard to balance added UTs, otherwise it should run timeout.
3. Fixed `test_invalid_artifact_flag_error_msg`.
4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`.
5. Skiped whole UT `test_cpu_select_algorithm.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161
Approved by: https://github.com/jansel
2025-08-10 23:18:35 +00:00
05c19d1ace [Inductor] Add back the revert part (#160054)
Add back the reverted code(https://github.com/pytorch/pytorch/pull/159809) as we've figured out the actual root cause of the internal test failures. Mote details in the internal diff.
Rollback Plan:

Differential Revision: D79776691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160054
Approved by: https://github.com/blaine-rister
2025-08-10 19:20:30 +00:00
d6786741a7 [inductor] slow test some Windows UTs. (#160267)
When we enabled Windows inductor UTs since the PR: https://github.com/pytorch/pytorch/pull/160161/
The main branch CI occurred timeout issue, Let's move some UT to slow test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160267
Approved by: https://github.com/ezyang
2025-08-10 18:35:42 +00:00
7ae0629d64 Revert "[inductor] turn on windows inductor UTs (#160161)"
This reverts commit f0980fc0bbd656d6c02d23ad97e945353b314f35.

Reverted https://github.com/pytorch/pytorch/pull/160161 on behalf of https://github.com/clee2000 due to broke some inductor tests on windows inductor\test_codecache.py::TestStandaloneCompile::test_different_process [GH job link](https://github.com/pytorch/pytorch/actions/runs/16853706010/job/47748778757) [HUD commit link](f0980fc0bb).  note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/160161#issuecomment-3172784292))
2025-08-10 17:33:19 +00:00
0e3e377bd5 [inductor] fix CompiledArtifact.load path on Windows. (#160268)
fix CompiledArtifact.load path on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160268
Approved by: https://github.com/ezyang
2025-08-10 14:22:52 +00:00
a84b60c0c4 [MPS] Sparse coalesce more dtypes to match cpu (#160254)
More dtypes to match the cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254
Approved by: https://github.com/malfet
2025-08-10 12:25:18 +00:00
3ac86e728d Add Alban and Piotr to list of maintainers (#160187)
Add Alban and Piotr to list of maintainers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160187
Approved by: https://github.com/albanD
2025-08-10 12:00:16 +00:00
c9671dc865 Delete Python reference implementation from torchdim, as it is untested (#160115)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160115
Approved by: https://github.com/albanD
2025-08-10 11:21:33 +00:00
af10f1f86c Fix requires_cuda to requires_cuda_and_triton (#160222)
Fixes ##159399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222
Approved by: https://github.com/janeyx99
2025-08-10 07:05:52 +00:00
5dddcd5b07 Correctly copy self.module_stack in ModuleStackTracer (#159956)
There is a bigger cluster of issues which this does not completely fix, but I think this is a matter of good hygiene, especially because we immediately mutate the dict after assigning it.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159956
Approved by: https://github.com/pianpwk
2025-08-10 03:33:59 +00:00
d3d359dbaf Revert "Fix get_free_symbol_uses for several nodes. (#160134)"
This reverts commit db78943a1ca13a32a3d6045eb15e2b719ee13a2f.

Reverted https://github.com/pytorch/pytorch/pull/160134 on behalf of https://github.com/malfet due to No, those are not pre-existing, see df55ec7d4b/1 ([comment](https://github.com/pytorch/pytorch/pull/160134#issuecomment-3172314322))
2025-08-10 02:37:40 +00:00
df55ec7d4b [OpInfo][BE] Better inputs for addmm (#160234)
Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234
Approved by: https://github.com/Skylion007
ghstack dependencies: #160228
2025-08-10 01:26:48 +00:00
f0980fc0bb [inductor] turn on windows inductor UTs (#160161)
With this PR, we can turn on the inductor UTs on Windows CPU.

changes:
1. Turn on inductor UTs on Windows CPU.
2. Add a shard to balance added UTs, otherwise it should run timeout.
3. Fixed `test_invalid_artifact_flag_error_msg`.
4. Skiped `test_distributed_rank_logging` and `test_disable_recursive_false`.
5. Skiped whole UT `test_cpu_select_algorithm.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160161
Approved by: https://github.com/jansel
2025-08-09 21:06:00 +00:00
db78943a1c Fix get_free_symbol_uses for several nodes. (#160134)
get_free_symbol_uses is used to know what unbacked symbols are used by a given node.
not having correct get_free_symbol_uses defined properly leads to :
1. eliminating of some nodes due to not detection of any users. (See the added unit test)
2. Incorrect topological sort.

Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel.
for ComputedBuffer with NonOwningLayout its interesting case.
when layout is NonOwningLayout we need to access the actual view op base layout and use
detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160134
Approved by: https://github.com/bobrenjc93
2025-08-09 18:15:46 +00:00
29712314dd [fx][pass] Support converting a float32 tensor to a scalar in FX trace. (#158216)
Fixes https://github.com/pytorch/pytorch/issues/158083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158216
Approved by: https://github.com/laithsakka
2025-08-09 15:13:13 +00:00
cyy
01f66d08d9 Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075)
Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075
Approved by: https://github.com/Skylion007
2025-08-09 14:23:17 +00:00
2f4c222617 Revert "Make user defined Triton kernels serializable for fx_graph_runnable (#160002)"
This reverts commit 4183d4ff3dcc1d87400326a9a7998c3f9e966f60.

Reverted https://github.com/pytorch/pytorch/pull/160002 on behalf of https://github.com/albanD due to Breaks inductor tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/160002#issuecomment-3170855866))
2025-08-09 14:01:58 +00:00
8047421fbb [Linter] Expanding the scope of detecting device-bias code. (#159949)
Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios:
1. Detect device-bias code in functions decorated with @requires_triton.
2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example:
```
if __name__ == "__main__":
    if HAS_GPU:
        run_tests()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-09 09:41:16 +00:00
4183d4ff3d Make user defined Triton kernels serializable for fx_graph_runnable (#160002)
Resolves issue https://github.com/pytorch/pytorch/issues/153475 where `fx_graph_runnable` didn't work with user defined triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160002
Approved by: https://github.com/eellison
2025-08-09 09:26:05 +00:00
fb887c3bb5 Add Sherlock and Zhengxu as codeowner for schema.py (#160233)
Test Plan:
CI

Rollback Plan:

Differential Revision: D79933462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160233
Approved by: https://github.com/zhxchen17
2025-08-09 04:44:12 +00:00
bcf23ecc47 [vllm hash update] update the pinned vllm hash (#160235)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160235
Approved by: https://github.com/pytorchbot
2025-08-09 04:17:32 +00:00
303c614f3d [dynamo] Be consistent with UserMethodVariable source (#160155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160155
Approved by: https://github.com/StrongerXi
2025-08-09 04:16:14 +00:00
0d88593dd8 [audio hash update] update the pinned audio hash (#160153)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160153
Approved by: https://github.com/pytorchbot
2025-08-09 04:01:31 +00:00
5ed4f91779 [dynamo] support itertools.permutations (#159694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159694
Approved by: https://github.com/guilhermeleobas
ghstack dependencies: #159693
2025-08-09 03:01:58 +00:00
e07c52b2c0 [dynamo] Improve support for itertools.product (#159693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159693
Approved by: https://github.com/guilhermeleobas, https://github.com/mlazos
2025-08-09 03:01:58 +00:00
cyy
10e3514c96 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-08-09 02:21:22 +00:00
11a3565f18 [Torch Native] Add test for packaging weight (#158750)
Add test that require weights to be packaged for torch native

For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model.

After we added weight deduping, we should be able to let this config be False.

```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter_weights
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750
Approved by: https://github.com/desertfire
2025-08-09 01:04:21 +00:00
e96c7c4bb0 [dcp][hf] Improve HF consolidation algorithm (#158648)
Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written

Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648
Approved by: https://github.com/saumishr
2025-08-09 00:11:22 +00:00
9b803cdbe2 [BE] Remove more optim entries from docs coverage ignore list (#160194)
This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2

If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-08-09 00:09:45 +00:00
8c41cb800a [MPS][BE] Combine all pre-MacOS14 xfail lists (#160228)
It does not matter whether it started to fail after 13.1 or 13.3, fact
that it still fails on latest MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160228
Approved by: https://github.com/dcci
2025-08-09 00:00:46 +00:00
731ee31f7b [TorchScript, PT2] Add torch._check compatibility support (#159988)
Summary:
Add support for torch._check() in TorchScript jit.script frontend.

* It will be special cased to behave like torch._assert, turned into an if + raise exception.

Test Plan:
Unit tests

Rollback Plan:

Differential Revision: D79744604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159988
Approved by: https://github.com/davidberard98
2025-08-08 23:14:13 +00:00
566c6d52ef [ONNX] Fix the export of the model having none as output (#160200)
Fixes #160150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160200
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-08-08 23:09:34 +00:00
4e2ddb5db6 [Inductor][CUTLASS] Copy cutlass_mock_imports directory (#159724)
Pip wheels of PyTorch nightly and 2.8 release candidates do not contain `cutlass_mock_imports`.

This is the path to the source code:
```
root@8120d02fd9c5:$ tree ./torch/_inductor/codegen/cuda/cutlass_lib_extensions/
./torch/_inductor/codegen/cuda/cutlass_lib_extensions/
├── cutlass_mock_imports
│   ├── cuda
│   │   ├── __init__.py
│   │   ├── cuda.py
│   │   └── cudart.py
│   ├── pydot
│   │   └── __init__.py
│   └── scipy
│       ├── __init__.py
│       └── special.py
├── evt_extensions.py
└── gemm_operation_extensions.py

5 directories, 8 files
```

And this what installed wheel has:
```
root@8120d02fd9c5:$ tree /usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/
/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/cuda/cutlass_lib_extensions/
├── __init__.py
├── evt_extensions.py
└── gemm_operation_extensions.py

1 directory, 3 files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159724
Approved by: https://github.com/henrylhtsang
2025-08-08 22:56:05 +00:00
9e07673deb Fix test_fsdp_ep.py due to _MeshEnv API change (#158695)
#132339 changed parent/child mesh related APIs from _MeshEnv. UT TestFSDPWithEP.test_e2e still uses old APIs and will fail:
```
File "/home/kanya/pytorch/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 77, in test_e2e
    mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, ("dp",))
AttributeError: '_MeshEnv' object has no attribute 'create_child_mesh'

To execute this test, run the following from the base repo dir:
    python test/distributed/checkpoint/e2e/test_fsdp_ep.py TestFSDPWithEP.test_e2e

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0. Did you mean: 'create_sub_mesh'?
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158695
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
2025-08-08 22:36:47 +00:00
1128f4c2a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-08-08 22:22:48 +00:00
334ecbd4ff Add torchao to install_inductor_benchmark_deps cleanup stage (#160191)
It looks like `torcho` was missed from the cleanup during torchbench setup.

Fixes #160188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160191
Approved by: https://github.com/huydhn
2025-08-08 22:18:41 +00:00
206c1eef65 Revert "[pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655)"
This reverts commit 2ee22e435131369a7e4f8cc4732579acc29a941b.

Reverted https://github.com/pytorch/pytorch/pull/159655 on behalf of https://github.com/clee2000 due to broke dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed [GH job link](https://github.com/pytorch/pytorch/actions/runs/16839294394/job/47711078667) [HUD commit link](2ee22e4351).  Probably a landrace since it did run on the PR ([comment](https://github.com/pytorch/pytorch/pull/159655#issuecomment-3169400889))
2025-08-08 22:04:22 +00:00
28ccc9e724 [MPS] Extend index_put to complex types (#160159)
And delete confusing supported types check.
Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header

Fixes https://github.com/pytorch/pytorch/issues/160034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007
2025-08-08 21:54:30 +00:00
2247aa6d1d Documents tuning NVLink performance on H100/H200 (#159792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792
Approved by: https://github.com/ngimel
2025-08-08 20:28:24 +00:00
1febab2a89 Do not treat ReinterpretView as a realized node (#159920)
Summary:
Do not treat ReinterpretView as a realized node

Function [gather_origins](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L888](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L888&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) calls is_realized_node to decide if a FX node should be included in the origins of a IR node. ReinterpretView is considered a realized node, so it is not included in the origins. It leads to an incomplete graph. For example:

```
@torchdynamo.optimize("inductor")
def fn(input_data, weight):
    normalized_input = input_data * weight.unsqueeze(0)
    return normalized_input
input_data = torch.randn(4272, 192, requires_grad=True).to(device)
weight = torch.randn(192, requires_grad=True).to(device)
fn(input_data, weight)
```

The original FX graph returned in [get_kernel_metadata](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L723](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fmain%2Ftorch%2F_inductor%2Futils.py%23L723&h=AT2PYr83thTo6VUjPs26Y8QAN6Sid16rvDMHtxO-Bp9FDwHr4J5PObtH3IhNTL-LPSRVC9WVJAcmwUToVWJIrDWb84i0j61QE55ySYAkGbuigqcNc7xczlirHhbiC9vMqiz91VwWdl4Pe2yKN7VIjjCiFUqw) is the following:
%primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2]
%primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1]
%mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {})
return %mul
The unsqueeze op is missing.

With this DIFF, the new FX graph is the following:
%primals_2 : Tensor "f32[4272, 192][192, 1]cuda:0" = PlaceHolder[target=primals_2]
%primals_1 : Tensor "f32[192][1]cuda:0" = PlaceHolder[target=primals_1]
%unsqueeze : Tensor "f32[1, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.unsqueeze.default](args = (%primals_1, 0), kwargs = {})
%mul : Tensor "f32[4272, 192][192, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %unsqueeze), kwargs = {})
return %mul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159920
Approved by: https://github.com/mlazos
2025-08-08 20:13:35 +00:00
2ee22e4351 [pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#159655)
This change logs the stack trace of the code being compiled by Dynamo, improving visibility into what is compiled. It adds a stack_trace field to compilation metrics. This helps with debugging and analysis of Dynamo compilation behavior.
 Ref [D79287964](https://www.internalfb.com/diff/D79287964)

Test Plan:
$ python -m test_utils
Internal: ref [D79372519](https://www.internalfb.com/diff/D79372519)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159655
Approved by: https://github.com/c00w
2025-08-08 19:53:47 +00:00
c86040a8e6 [torch.export] Fix test_export_api_with_dynamic_shapes (#160164)
Summary: Update test KJT's dynamic_shapes to match the newly exported fields.

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:test_export -- --exact 'caffe2/test:test_export - test_export_api_with_dynamic_shapes_cpp_runtime_nonstrict (caffe2.test.export.test_nativert.NativeRTTestExport)'
File changed: fbcode//caffe2/test/export/test_export.py
Buck UI:
https://www.internalfb.com/buck2/8247eaf8-eaf9-4876-95cb-7b4263d15ef2
Test UI:
https://www.internalfb.com/intern/testinfra/testrun/2533275093345198
Network: Up: 100KiB  Down: 0B  (reSessionID-72a2579f-df3f-4262-9aa3-de0db9687
Executing actions. Remaining 0/2
Command: test.
Time elapsed: 2:20.5s
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Reviewed By: malaybag

Differential Revision: D79862872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160164
Approved by: https://github.com/angelayi, https://github.com/ezyang
2025-08-08 19:45:30 +00:00
72009ec6be [replicate][be] improved readability and cleaned up remaining DDP code (#160133)
**Summary**
As much of ReplicateState functionality is copied from FSDPState, I fixed any remaining comments that incorrectly used FSDP instead of Replicate. In addition, instead of labeling modules FSDPModule or FSDPLinear, I have changed it so that is now uses Replicate____. Finally, I have removed some leftover code from the DDP implementation. I have included test cases to verify correctness.

**Test Case**
1. pytest test/distributed/_composable/test_replicate_with_fsdp.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160133
Approved by: https://github.com/mori360
ghstack dependencies: #160128
2025-08-08 19:42:23 +00:00
5f5f508aa8 [ROCm] Ck backend UX refactor (#152951)
Refactors how the enablement/disablement of CK Gemms and SDPA works.

- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-08-08 18:40:17 +00:00
da1f608ca3 Add UT for torch.accelerator memory-related API (#155200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200
Approved by: https://github.com/albanD
ghstack dependencies: #138222, #152932
2025-08-08 17:41:22 +00:00
84f7e88aef Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-08 17:41:22 +00:00
d7114f05b1 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-08 17:41:10 +00:00
c5ec5458a5 Don't build nccl when distributed is disabled (#160086)
Because distributed doesn't build on recent compilers, I have to disable distributed, but this makes it still fail as nccl is still built
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160086
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-08-08 17:19:16 +00:00
86eb65f7f0 [MPS] Move max_pool2d to Metal for stride != 1 (#157876)
This PR updates `max_pool2d` to use a Metal kernel instead of the old MPS graph impl. However, when the `stride` argument is 1 in all dimensions, the old implementation gives significantly better performance, so we fall back to it in that case. Below is a performance comparison of `max_pool2d` before and after this PR, obtained from this script: 2f02f2bf7a/max_pool_mps/perf.py

<details><summary>Click to expand</summary>

case | before PR | after PR | speedup |   | case info
-- | -- | -- | -- | -- | --
0 | 0.014264 | 0.004473 | 3.188911245 |   | (3, 2, 2), {'kernel_size': 2, 'return_indices': True}
1 | 0.010752 | 0.00421 | 2.55391924 |   | (3, 2, 2), {'kernel_size': 2, 'return_indices': False}
2 | 0.020777 | 0.006123 | 3.393271272 |   | (3, 10, 10), {'kernel_size': 5, 'return_indices': True}
3 | 0.011065 | 0.005759 | 1.921340511 |   | (3, 10, 10), {'kernel_size': 5, 'return_indices': False}
4 | 0.01452 | 0.007829 | 1.854642994 |   | (3, 100, 100), {'kernel_size': 5, 'return_indices': True}
5 | 0.009258 | 0.007075 | 1.308551237 |   | (3, 100, 100), {'kernel_size': 5, 'return_indices': False}
6 | 0.188137 | 0.168688 | 1.115295694 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True}
7 | 0.161362 | 0.154746 | 1.042753932 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False}
8 | 0.182883 | 0.16945 | 1.079274122 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True}
9 | 0.156875 | 0.163346 | 0.9603847049 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False}
10 | 0.193433 | 0.167396 | 1.155541351 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True}
11 | 0.158967 | 0.151246 | 1.051049284 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False}
12 | 0.931071 | 0.932883 | 0.9980576342 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True}
13 | 0.324496 | 0.3252 | 0.9978351784 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False}
14 | 0.944071 | 0.936246 | 1.008357846 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True}
15 | 0.322171 | 0.314854 | 1.023239343 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False}
16 | 0.894158 | 0.886408 | 1.008743152 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True}
17 | 0.309338 | 0.304146 | 1.017070749 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False}
18 | 0.606 | 0.260546 | 2.325884873 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True}
19 | 0.30445 | 0.231054 | 1.317657344 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False}
20 | 0.474708 | 0.261925 | 1.812381407 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True}
21 | 0.23175 | 0.231883 | 0.9994264349 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False}
22 | 0.434475 | 0.266246 | 1.631855502 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True}
23 | 0.236942 | 0.231792 | 1.022218196 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False}
24 | 0.202396 | 0.174888 | 1.157289237 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True}
25 | 0.160679 | 0.158246 | 1.015374796 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False}
26 | 0.200354 | 0.184133 | 1.088093932 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True}
27 | 0.160779 | 0.160679 | 1.000622359 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False}
28 | 0.199175 | 0.178625 | 1.115045486 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True}
29 | 0.159458 | 0.160883 | 0.9911426316 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False}
30 | 0.199021 | 0.165329 | 1.203787599 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True}
31 | 0.156337 | 0.158213 | 0.9881425673 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False}
32 | 0.180146 | 0.174483 | 1.032455884 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True}
33 | 0.156988 | 0.158167 | 0.9925458534 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False}
34 | 0.182133 | 0.176521 | 1.031792251 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True}
35 | 0.169042 | 0.156483 | 1.080257919 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False}
36 | 1.767821 | 1.766254 | 1.000887188 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True}
37 | 1.059346 | 1.058775 | 1.000539302 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False}
38 | 1.85755 | 1.859429 | 0.9989894747 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True}
39 | 1.100417 | 1.097683 | 1.002490701 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False}
40 | 1.843167 | 1.847558 | 0.9976233493 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True}
41 | 1.090142 | 1.093163 | 0.9972364597 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False}
42 | 0.480867 | 0.251733 | 1.910226311 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True}
43 | 0.319246 | 0.236479 | 1.349997251 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False}
44 | 0.49315 | 0.256408 | 1.923301925 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True}
45 | 0.316746 | 0.227854 | 1.390127011 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False}
46 | 0.4912 | 0.257762 | 1.905633879 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True}
47 | 0.324771 | 0.229371 | 1.41592006 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False}
48 | 0.152904 | 0.095079 | 1.608178462 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True}
49 | 0.102963 | 0.089217 | 1.154073775 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False}
50 | 0.155158 | 0.095429 | 1.625899884 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True}
51 | 0.104338 | 0.089979 | 1.15958168 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False}
52 | 0.153121 | 0.096429 | 1.587914424 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True}
53 | 0.103642 | 0.090254 | 1.148336916 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False}
54 | 0.191071 | 0.165125 | 1.157129447 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True}
55 | 0.153971 | 0.149021 | 1.033216795 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False}
56 | 0.193192 | 0.166892 | 1.157586942 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True}
57 | 0.156617 | 0.15215 | 1.029359185 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False}
58 | 0.178033 | 0.167308 | 1.06410333 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True}
59 | 0.157425 | 0.164404 | 0.9575496947 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False}
60 | 1.757638 | 1.750896 | 1.0038506 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True}
61 | 1.048471 | 1.047967 | 1.000480931 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False}
62 | 1.790708 | 1.789767 | 1.000525767 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True}
63 | 1.054575 | 1.054796 | 0.9997904808 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False}
64 | 1.785837 | 1.784192 | 1.000921986 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True}
65 | 1.054713 | 1.054492 | 1.00020958 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False}
66 | 0.478267 | 0.261017 | 1.832321266 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True}
67 | 0.32005 | 0.226654 | 1.412064204 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False}
68 | 0.484008 | 0.254721 | 1.900149575 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True}
69 | 0.321 | 0.218842 | 1.466811672 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False}
70 | 0.482087 | 0.248771 | 1.937874591 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True}
71 | 0.316558 | 0.230533 | 1.373156988 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False}
72 | 0.137842 | 0.085088 | 1.619993419 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True}
73 | 0.100671 | 0.0769 | 1.309115735 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False}
74 | 0.148321 | 0.086967 | 1.705485989 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True}
75 | 0.101392 | 0.075454 | 1.343759112 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False}
76 | 0.150208 | 0.083742 | 1.793699697 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True}
77 | 0.099587 | 0.075825 | 1.313379492 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False}
78 | 0.622546 | 0.602729 | 1.03287879 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True}
79 | 0.531696 | 0.5067 | 1.049330965 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False}
80 | 0.626646 | 0.617038 | 1.015571164 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True}
81 | 0.530354 | 0.525367 | 1.009492412 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False}
82 | 0.633933 | 0.577775 | 1.097197006 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True}
83 | 0.533067 | 0.526954 | 1.011600633 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False}
84 | 3.372867 | 3.386412 | 0.9960001914 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True}
85 | 1.155975 | 1.156604 | 0.9994561665 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False}
86 | 3.401921 | 3.39755 | 1.001286515 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True}
87 | 1.202829 | 1.192538 | 1.008629494 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False}
88 | 3.23675 | 3.220238 | 1.005127571 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True}
89 | 1.077067 | 1.085613 | 0.9921279498 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False}
90 | 1.572925 | 0.925625 | 1.699311276 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True}
91 | 0.791204 | 0.793454 | 0.9971642969 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False}
92 | 1.572742 | 0.922729 | 1.704446268 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True}
93 | 0.784292 | 0.788871 | 0.9941955022 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False}
94 | 1.526546 | 0.925708 | 1.649057802 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True}
95 | 0.769321 | 0.787675 | 0.9766985114 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False}
96 | 0.736033 | 0.612808 | 1.201082558 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True}
97 | 0.574625 | 0.530925 | 1.082309177 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False}
98 | 0.722021 | 0.614488 | 1.174996094 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True}
99 | 0.563171 | 0.533721 | 1.055178642 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False}
100 | 0.735725 | 0.613992 | 1.198264798 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True}
101 | 0.583487 | 0.532513 | 1.095723485 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False}
102 | 0.656383 | 0.575313 | 1.140914598 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True}
103 | 0.559796 | 0.509079 | 1.099625009 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False}
104 | 0.662046 | 0.572362 | 1.156691045 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True}
105 | 0.552633 | 0.508671 | 1.086425214 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False}
106 | 0.634108 | 0.574629 | 1.103508525 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True}
107 | 0.534013 | 0.510996 | 1.045043405 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False}
108 | 7.056642 | 7.066717 | 0.9985743026 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True}
109 | 4.144275 | 4.142658 | 1.000390329 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False}
110 | 7.172683 | 7.189867 | 0.9976099697 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True}
111 | 4.162538 | 4.158875 | 1.000880767 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False}
112 | 7.194233 | 7.181837 | 1.001726021 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True}
113 | 4.294083 | 4.196062 | 1.023360236 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False}
114 | 1.875692 | 0.891071 | 2.104986022 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True}
115 | 1.097479 | 0.781175 | 1.404907991 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False}
116 | 1.8883 | 0.89015 | 2.121327866 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True}
117 | 1.101329 | 0.778542 | 1.414604479 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False}
118 | 1.872833 | 0.893654 | 2.095702587 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True}
119 | 1.096712 | 0.784579 | 1.397835017 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False}
120 | 0.513029 | 0.374417 | 1.370207549 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True}
121 | 0.349546 | 0.305763 | 1.143192603 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False}
122 | 0.518929 | 0.377487 | 1.374693698 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True}
123 | 0.364662 | 0.3145 | 1.159497615 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False}
124 | 0.521275 | 0.375242 | 1.389170189 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True}
125 | 0.367488 | 0.308354 | 1.191773092 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False}
126 | 0.652342 | 0.569308 | 1.145850752 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True}
127 | 0.555696 | 0.506892 | 1.096280865 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False}
128 | 0.654333 | 0.570367 | 1.147213987 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True}
129 | 0.548925 | 0.505825 | 1.085207335 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False}
130 | 0.655908 | 0.571904 | 1.146884792 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True}
131 | 0.560808 | 0.508238 | 1.103435792 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False}
132 | 6.949462 | 6.949112 | 1.000050366 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True}
133 | 4.072913 | 4.065013 | 1.001943413 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False}
134 | 7.200896 | 7.197792 | 1.000431243 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True}
135 | 4.291367 | 4.218538 | 1.017264038 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False}
136 | 7.1823 | 7.306933 | 0.9829431856 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True}
137 | 4.151175 | 4.149592 | 1.000381483 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False}
138 | 1.781279 | 0.884288 | 2.014365229 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True}
139 | 1.050804 | 0.774362 | 1.356993241 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False}
140 | 1.860758 | 0.884637 | 2.103414169 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True}
141 | 1.099908 | 0.775887 | 1.417613647 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False}
142 | 1.857387 | 0.885738 | 2.096993693 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True}
143 | 1.105279 | 0.77365 | 1.428655077 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False}
144 | 0.489408 | 0.269583 | 1.815426047 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True}
145 | 0.322525 | 0.236979 | 1.360985573 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False}
146 | 0.515475 | 0.265813 | 1.93923924 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True}
147 | 0.315525 | 0.228146 | 1.382995976 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False}
148 | 0.503438 | 0.277204 | 1.816128194 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True}
149 | 0.335421 | 0.228275 | 1.469372467 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False}
150 | 5.72495 | 4.909554 | 1.166083518 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': True}
151 | 4.45215 | 4.251333 | 1.047236243 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': False}
152 | 29.953021 | 29.879879 | 1.002447868 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True}
153 | 9.854683 | 9.839517 | 1.001541336 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False}
154 | 6.178033 | 5.697375 | 1.084364817 |   | (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': True}
155 | 6.280317 | 5.712525 | 1.099394226 |   | (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': False}
156 | 10.256062 | 11.336527 | 0.9046917103 |   | (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': True}
157 | 9.469546 | 11.33705 | 0.8352742556 |   | (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': False}
158 | 0.119087 | 0.0797 | 1.494190715 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True}
159 | 0.098713 | 0.047173 | 2.092574142 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False}
160 | 0.960812 | 0.675762 | 1.421820108 |   | (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': True}
161 | 0.536546 | 0.485958 | 1.104099531 |   | (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': False}
162 | 2.555225 | 1.791567 | 1.426251432 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True}
163 | 1.419087 | 1.305137 | 1.087308842 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False}
164 | 5.182008 | 3.48085 | 1.488719135 |   | (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': True}
165 | 2.831779 | 2.498537 | 1.133374851 |   | (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': False}
166 | 8.546038 | 5.7783 | 1.478988284 |   | (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': True}
167 | 4.731004 | 4.161975 | 1.136720908 |   | (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': False}
168 | 0.084754 | 0.07435 | 1.139932751 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True}
169 | 0.057933 | 0.043096 | 1.344277891 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False}
170 | 2.568592 | 1.802117 | 1.425319222 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True}
171 | 1.433054 | 1.307342 | 1.096158465 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False}
172 | 10.3213 | 7.111604 | 1.451332217 |   | (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': True}
173 | 5.680525 | 5.168129 | 1.099145358 |   | (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': False}
174 | 1.02255 | 1.01375 | 1.008680641 |   | (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': False}
175 | 3.074233 | 3.094383 | 0.993488201 |   | (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': True}
176 | 1.016812 | 1.030575 | 0.9866453194 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False}
177 | 3.053658 | 3.089504 | 0.9883974903 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True}
178 | 1.025863 | 1.032088 | 0.9939685376 |   | (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': False}
179 | 3.798942 | 3.799213 | 0.9999286694 |   | (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': True}
180 | 4.492979 | 4.493421 | 0.999901634 |   | (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': False}
181 | 51.543363 | 51.266204 | 1.005406271 |   | (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': True}
182 | 1.018008 | 1.001587 | 1.016394981 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': False}
183 | 3.035404 | 3.003113 | 1.010752509 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': True}
184 | 0.610421 | 0.56 | 1.0900375 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': False}
185 | 1.138983 | 0.757296 | 1.504012962 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': True}
186 | 0.641558 | 0.557808 | 1.150141267 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': False}
187 | 1.181475 | 0.754725 | 1.565437742 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': True}
188 | 1.03045 | 1.026904 | 1.003453098 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': False}
189 | 3.041421 | 3.0263 | 1.00499653 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': True}
190 | 0.609929 | 0.572304 | 1.065743032 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': False}
191 | 1.146875 | 0.756446 | 1.516135983 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': True}
192 | 0.645187 | 0.561708 | 1.148616363 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': False}
193 | 1.181721 | 0.758054 | 1.558887625 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': True}
194 | 0.927654 | 0.925946 | 1.0018446 |   | (10, 1000, 1000), {'kernel_size': 1, 'return_indices': False}
195 | 2.749983 | 2.740354 | 1.00351378 |   | (10, 1000, 1000), {'kernel_size': 1, 'return_indices': True}

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157876
Approved by: https://github.com/malfet
2025-08-08 16:40:10 +00:00
a4f69a5da0 [dynamo][guards] Remove guards on stdlib modules (#159913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159913
Approved by: https://github.com/StrongerXi
2025-08-08 16:26:04 +00:00
231c72240d CMake build: preserve PYTHONPATH (#160144)
Fixes #160092

I'm very new to CMake, so let me know if there's a fancier way to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160144
Approved by: https://github.com/malfet

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-08-08 16:03:49 +00:00
50f23ff6f8 rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883)
Fixes #159399
"Modified torch.testing._internal.inductor_utils and test/inductor"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883
Approved by: https://github.com/janeyx99
2025-08-08 15:44:52 +00:00
8a37f0c903 improve gather and scatter_add strategy (#160140)
As title.

This PR made a small fix on top of https://github.com/meta-pytorch/autoparallel/pull/81.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160140
Approved by: https://github.com/fmassa
2025-08-08 15:06:24 +00:00
b5fd7223b1 Improve pin_memory error message on CPU-only systems (#159994)
## Summary
- clarify pin_memory error message when no accelerator backend is available

## Testing
- `python repro_pin_memory.py` (fails: Need to provide pin_memory allocator to use pin memory)
- `lintrunner -a`

------
https://chatgpt.com/codex/tasks/task_e_6893ba92c93483238a9bdfdd6c52812b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159994
Approved by: https://github.com/albanD
2025-08-08 14:36:45 +00:00
9fa8ce26cf Working setup with runnable PyTorch on Codex. (#159968)
Sample transcript: https://chatgpt.com/s/cd_68938effc1a88191ae78bc82a8cefe94

This makes use of https://github.com/pytorch/pytorch/pull/159965 to bypass doing an actual build and use nightly.

Things to improve:
- Once USE_NIGHTLY is in main can remove the patching
- We should just keep using the latest nightly, instead of a hard coded one

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159968
Approved by: https://github.com/wdvr
2025-08-08 14:34:15 +00:00
62bac07981 [inductor][triton] support profile_scratch launcher arg (#159772)
This adds support for Triton after https://github.com/triton-lang/triton/pull/7258 landed. https://github.com/triton-lang/triton/pull/7258 adds a new argument to all the Triton kernels - a profile_scratch argument, similar to global_scratch. This PR updates the static cuda launcher and the AOTI kernel callers to pass in these arguments when calling the Triton kernel.

Tests: https://github.com/pytorch/pytorch/pull/159158. I also verified these test locally with triton 3.2, 3.3, and 3.4.

Fixes:
* static_cuda_launcher (test/repro: `python tools/dynamo/verify_dynamo.py`)
* AOTI calling logic (test/repro: `TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_linalg_vander_cuda_float32`)

Differential Revision: [D79825121](https://our.internmc.facebook.com/intern/diff/D79825121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159772
Approved by: https://github.com/NikhilAPatel, https://github.com/eellison
2025-08-08 14:27:38 +00:00
7f4cb4a3e0 [MPS] coalesce for sparse tensors (#159729)
MPS coalesce function for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-08 13:49:55 +00:00
556e2a73f4 [Test][Easy] Use float16 dtype in test_sort_large (#159939)
The test fails with:
>RuntimeError: var_mean only support floating point and complex dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159939
Approved by: https://github.com/eqy
2025-08-08 09:56:44 +00:00
178515d0ff [BE][PYFMT] remove black: finish black -> ruff format migration (#144557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144557
Approved by: https://github.com/ezyang
2025-08-08 07:46:10 +00:00
3a56237440 [SymmMem] Send tensors with unerased type information to NVSHMEM Triton kernels (#159788)
This PR introduces a small `@triton.jit` wrapper function over our core NVSHMEM extern functions for users to send tensors as inputs to their NVSHMEM Triton kernels (rather than pointers).

The goal is to abstract away tedious details from the developer, like manual byte-size calculations and handling of raw `int64` pointers. This lets developers work directly with typed Triton tensors and element counts, which will also be useful if you want to do for instance some local math on the data.

-----

**TODO:**
This is almost complete. One pending item is tensor-aware implementation of `nvshmem.putmem_signal_block `and `nvshmem.signal_wait_until`

From my investigation, I found the root cause to be that this specific tensor API uses local addresses instead of remote addresses for the peer

```
Pointer-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Remote buffer:  0x2430300c00 (dst) ← Rank 1's memory
    Remote signal:  0x2430301600 (sig) ← Rank 1's signal

  Rank 1 (waiting):
    Local signal:   0x430301600 (waits here)

Tensor-Based Version:

  Rank 0 → Rank 1:
    Local buffer:   0x430300a00  (src)
    Local buffer:   0x430300c00  (dst) ← this is wrong
    Local signal:   0x430300e00  (sig) ← this is wrong

  Rank 1 (waiting):
    Local signal:   0x430300e00 (waits here)

```

Next Steps: Need mechanism to resolve local tensor → remote PE address, equivalent to handle.buffer_ptrs[peer] lookup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159788
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755, #159756
2025-08-08 05:20:42 +00:00
e0d8a315c5 [SymmMem] Add helpful docstrings for all NVSHMEM APIs (#159756)
Fed Claude Code NVSHMEM Documentation and asked it to generate helpful docstrings. Verified for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159756
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734, #159755
2025-08-08 05:20:42 +00:00
bfff2e3592 [SymmMem] Refactor NVSHMEM Reduction API to be more ergonomic with automatic dtype‐based dispatch (#159755)
This change introduces a single, generic Triton‐extern wrapper for NVSHMEM team‐based reductions. We now expose one function, `nvshmem.reduce(team, dest, source, nreduce, operation, dtype_id)`, that covers all supported ops (sum, max, min, prod) and dtypes (int8…int64, uint8…uint64, float16, bfloat16, float32, float64).

It accepts real dtype objects (torch.dtype or tl.dtype) directly in the Triton kernel launch. Internally, we normalize dtype_id (handling tl.dtype, torch.dtype, str, or constexpr) into the canonical NVSHMEM typename and assemble the proper function name, e.g. nvshmem_float_sum_reduce or nvshmem_bfloat16_prod_reduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159755
Approved by: https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701, #159734
2025-08-08 05:20:36 +00:00
1c881440f4 [SymmMem] Initialize NVSHMEM module only for kernels that have nvshmem in their name (#159734)
Previously, a global post-compile hook initialized the NVSHMEM module for all Triton kernels, which was inefficient. This change conditionally initializes  `_nvshmemx_cumodule_init(kernel.module)` only for Triton kernels containing "nvshmem" in their name. Also updated the names for all of our nvshmem kernels to align with this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159734
Approved by: https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215, #159701
2025-08-08 05:20:29 +00:00
7c4f7b9340 [SymmMem] Add Triton 3.4 support to NVSHMEM Triton and fix CI tests (make device library discoverable + fix peer calculation bug) (#159701)
This PR introduces support for Triton 3.4 and resolves several CI and test-related issues.

**Triton 3.4 Compatibility**
- The JIT post-compile hook has been updated from the legacy JITFunction.compiled_hook to the new API path at triton.knobs.runtime.jit_post_compile_hook.
- The internal parameter for kernel semantics in extern function definitions has been updated from _semantic to _builder to align with API changes.

**Fix CI Errors**
- The new logic inspects the RPATH of libtorch_nvshmem.so to find the NVSHMEM device library, preventing CI tests from being skipped.
- Added a decorator to run NVSHMEM tests only on H100s (compatible hardware)

**Peer Rank Calculation Fix**
- The peer calculation in test_nvshmem_triton.py was changed from peer = (world_size - 1) - rank to peer = 1 - rank.
Reasoning: The previous logic was only valid for a 2-rank setup. In the 8-rank CI environment, it incorrectly mapped peers (e.g., rank 0 to 7), breaking tests that assume a 0↔1 communication pattern. This was reproduced and validated on an 8-rank dev setup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159701
Approved by: https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136, #159215
2025-08-08 05:20:22 +00:00
1783d6e966 [SymmMem] Fix flaky wait_until test (#159215)
When playing around with it, I noticed some flakiness in this test across sessions.

After debugging, turns out the heavy sync primitives that I was calling (like `nvshmem_quiet()` or `nvshmem_fence()`) from inside Triton kernels was causing deadlocks. The original test tried to guarantee ordering: `put(data) -> fence/quiet -> put(flag)`. But the GPU thread got stuck in `quiet()` waiting for network confirmation while holding the SM, creating a deadlock.

The fix was realizing `wait_until` already provides all the sync you need. Just do:
- PE A: `nvshmem_wait_until(&ivar, ...)`
- PE B: `nvshmem_put(&ivar_on_PE_A, ...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159215
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718, #159136
2025-08-08 05:20:16 +00:00
ea7fe0ecf6 [SymmMem] Standardize NVSHMEM Triton wrappers on byte-based APIs + improve code clarity (#159136)
Quick refactor for consistency and clarity.

1. We now standardize all NVSHMEM data-moving collectives (put, get, alltoall, broadcast) to use their byte-based *_mem_block variants. This makes the API behavior more predictable and avoids mixing paradigms.

2. Previously, some functions operated on element counts (nelems), while others expected byte sizes but still used `nelems` as the param name. That inconsistency was easy to miss and could lead to bugs, especially for devs not familiar with the NVSHMEM internals.

To clean this up:
	•	All byte-based APIs now use nbytes or nbytes_per_pe to make the units explicit.
	•	Typed APIs consistently use nelems for element counts.
	•	Docstrings were added or updated to clarify expected units.

Also did some code cleanup — removed unused functions, fixed typos in comments, and did some general housekeeping.

This should make the API more intuitive and reduce friction for developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159136
Approved by: https://github.com/mandroid6, https://github.com/ngimel
ghstack dependencies: #158515, #158718
2025-08-08 05:20:09 +00:00
b0b229b197 [SymmMem] Use _get_default_group() instead of group.WORLD for group_name access (#158718)
Both approaches functionally return the default process group created by `init_process_group()` but `_get_default_group()` is a dedicated function with [better error handling and type safety](4869f71170/torch/distributed/distributed_c10d.py (L1300-L1310)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158718
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
ghstack dependencies: #158515
2025-08-08 05:20:02 +00:00
b5c937259b [SymmMem] Add NVSHMEM Reduction support (sum, min, max) into Triton (#158515)
Implements sum_reduce, min_reduce, and max_reduce collective operations for NVSHMEM Triton kernels. Enables parallel reduction computations across PE teams for int64 data types.

Tests: `python test/distributed/test_nvshmem_triton.py`

<details>
<summary> Quick debug print for sanity check </summary>

```markdown
============================================================
[Rank 1] Starting min/max reduction test with world_size=2
============================================================
============================================================
[Rank 0] Starting min/max reduction test with world_size=2
============================================================
[Rank 0] Source data for min/max: [10, 20]
[Rank 1] Source data for min/max: [15, 5]
[Rank 1] All values across PEs:
[Rank 0] All values across PEs:
  - Position 0: [10, 15]
  - Position 0: [10, 15]
  - Position 1: [20, 5]
  - Position 1: [20, 5]
[Rank 1] Expected min: [10, 5]
[Rank 0] Expected min: [10, 5]
[Rank 1] Expected max: [15, 20]
[Rank 0] Expected max: [15, 20]
[Rank 0] Executing MIN reduction...
[Rank 1] Executing MIN reduction...
[Rank 0] Executing MAX reduction...
[Rank 1] Executing MAX reduction...
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Results:
[Rank 0] Results:
[Rank 1] MIN reduction result: [10, 5]
[Rank 1] MAX reduction result: [15, 20]
[Rank 0] MIN reduction result: [10, 5]
[Rank 0] MAX reduction result: [15, 20]
[Rank 1] ============================================================
[Rank 1] Min/Max reduction test PASSED ✓
[Rank 1] ============================================================
[Rank 0] ============================================================
[Rank 0] Min/Max reduction test PASSED ✓
[Rank 0] ============================================================
......
============================================================
============================================================
[Rank 0] Starting sum reduction test with world_size=2
[Rank 1] Starting sum reduction test with world_size=2
============================================================
============================================================
[Rank 0] Configuration:
[Rank 1] Configuration:
  - nreduce: 3 (number of separate reductions)
  - nreduce: 3 (number of separate reductions)
  - dtype: torch.int64
  - dtype: torch.int64
[Rank 1] Source data: [2, 4, 6]
[Rank 1] Contribution explanation:
[Rank 0] Source data: [1, 2, 3]
[Rank 0] Contribution explanation:
  - Element 0: 2 = (rank=1+1) * (index=0+1)
  - Element 0: 1 = (rank=0+1) * (index=0+1)
  - Element 1: 4 = (rank=1+1) * (index=1+1)
  - Element 1: 2 = (rank=0+1) * (index=1+1)
  - Element 2: 6 = (rank=1+1) * (index=2+1)
  - Element 2: 3 = (rank=0+1) * (index=2+1)
[Rank 1] Initial destination: [-1, -1, -1]
[Rank 0] Initial destination: [-1, -1, -1]
[Rank 0] Expected results after reduction: [3, 6, 9]
[Rank 1] Expected results after reduction: [3, 6, 9]
[Rank 0] Executing sum reduction...
[Rank 1] Executing sum reduction...
[Rank 1] Sum reduction completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Sum reduction completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Results after reduction:
[Rank 0] Destination buffer: [3, 6, 9]
[Rank 1] Results after reduction:
[Rank 0] Verification:
  - Reduction 0: PE0: 1 + PE1: 2 = 3
    Result: 3, Match: ✓
  - Reduction 1: PE0: 2 + PE1: 4 = 6
    Result: 6, Match: ✓
[Rank 1] Destination buffer: [3, 6, 9]
  - Reduction 2: PE0: 3 + PE1: 6 = 9
[Rank 1] Verification:
  - Reduction 0: PE0: 1 + PE1: 2 = 3
    Result: 9, Match: ✓
    Result: 3, Match: ✓
  - Reduction 1: PE0: 2 + PE1: 4 = 6
    Result: 6, Match: ✓
  - Reduction 2: PE0: 3 + PE1: 6 = 9
    Result: 9, Match: ✓
[Rank 0] ============================================================
[Rank 0] Sum reduction test PASSED ✓
[Rank 0] All 3 reductions computed correctly across 2 PEs
[Rank 0] ============================================================
[Rank 1] ============================================================
[Rank 1] Sum reduction test PASSED ✓
[Rank 1] All 3 reductions computed correctly across 2 PEs
[Rank 1] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158515
Approved by: https://github.com/mandroid6, https://github.com/ngimel
2025-08-08 05:19:55 +00:00
24257f5bfa [vllm hash update] update the pinned vllm hash (#159822)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159822
Approved by: https://github.com/pytorchbot
2025-08-08 04:13:48 +00:00
017259f9c6 [benchmarks] Add nativert benchmark (#159922)
Add NativeRT as an option in the PT2 OSS benchmark

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert

python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert

python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922
Approved by: https://github.com/angelayi
2025-08-08 03:38:32 +00:00
2ea40fba84 [Linter] Improve device-bias linter by adding detection for with torch.device("cuda"). (#159926)
```
For example, detect the following situation:
>>>Lint for test/dynamo/test_modes.py:
  Error (TEST_DEVICE_BIAS) [device-bias]
    `@requires_gpu` function should not hardcode `with torch.device('cuda')`,
    suggest to use torch.device(GPU_TYPE)

        687  |            flex_attention as flex_attention_eager,
        688  |        )
        689  |
    >>> 690  |        with torch.device("cuda"):
        691  |            flex_attention = torch.compile(flex_attention_eager, dynamic=False)
        692  |
        693  |            with self.assertRaisesRegex(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159926
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #159759
2025-08-08 03:20:42 +00:00
beb4d7816d [BE]: ruff PLC0207 - use maxsplit kwarg (#160107)
Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107
Approved by: https://github.com/albanD
2025-08-08 03:14:59 +00:00
3fcd79e023 Fix infinite loop when iterating over an empty zip (#159673)
Dynamo would enter in an infinite recursion when
`ZipVariable.next_variable(tx)` was called and there was no iterable to
be iterated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159673
Approved by: https://github.com/williamwen42
2025-08-08 02:50:21 +00:00
05c417715f integrate kernacle into inductor (#160121)
This adds integration into inductor in two parts

1) It kicks off the best config lookup at lowering time within mm.py
2) It awaits the future at scheduling time in select_algorithm.py

Notably this does not do the following

1) Support for enumerating between mm, addmm and bmm
2) Support for enumerating between exhaustive/max
3) Enumerating different hardware SKUs eg. H100, A100, etc.

those will come in the next diffs

Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121
Approved by: https://github.com/izaitsevfb
2025-08-08 02:14:44 +00:00
ba4ccf5d67 turn on executon frame clenaup by default (#160110)
Summary: Turning execution frame cleanup back on since D78621408 is done

Test Plan:
See D78621408

Rollback Plan:

Differential Revision: D79730674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160110
Approved by: https://github.com/jingsh
2025-08-08 02:13:48 +00:00
d68c323692 Log max_autotune exceptions (#159687) (#159688)
Summary:

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```

The exception tracebacks:
```
# inner exception
traceback:
  File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
    launchers.append(result.make_launcher())
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
    self.kernel.load_kernel(device)
  File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
    (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(

# wrapped exception
traceback:
  File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
    choice.precompile()
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
    self.bmreq.precompile()
  File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
    getattr(mod, self.kernel_name).precompile()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
    self._make_launchers()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
    raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.

The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688
Approved by: https://github.com/stashuk-olek
2025-08-08 01:30:08 +00:00
03b254e49f Extend torch function support to ALL arguments, not just scalar type (but not insides of list) (#145089)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145089
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-08-07 23:43:53 +00:00
195b5c2e27 Revert "dynamo: Remove passing or deleted dynamo_expected_failures (#159691)"
This reverts commit 36f46d082a4954921cb8493223f000f2aab79ed7.

Reverted https://github.com/pytorch/pytorch/pull/159691 on behalf of https://github.com/izaitsevfb due to breaking dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/159691#issuecomment-3166067241))
2025-08-07 22:55:51 +00:00
f077c2402e [replicate][be] improved readability of test case description (#160128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160128
Approved by: https://github.com/mori360
2025-08-07 22:51:58 +00:00
d46768db04 [MTIA] Allow users who know what they are doing to ignore all device mismatches in tracing and take a preferred device. (#159931)
Summary:
Device mismatches in tracing can most often be ignored. These are only logical mismatches not physical.

Take any intermediate computation, and that computation will not actually materialize in a compiled binary execution. So a device mismatch in the middle of the program is not real. The runtime will never materialize those tensors on CPU device during the execution, as they are temporary allocations.

If a user knows his tensors at graph input are all on the correct device, then he can ignore all tracing errors.

Users who know what they are doing should have an escape hatch to ignore any device mismatch in tracing.

Users can set
```
  torch._functorch.config.fake_tensor_prefer_device_type = 'mtia'
```
to forcefully override any mismatch and prefer the non cpu device. This unblocks vLLM graph mode for MTIA.

Test Plan:
Added two unit tests.

Rollback Plan:

Differential Revision: D79698438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159931
Approved by: https://github.com/jansel
2025-08-07 22:37:15 +00:00
clr
36f46d082a dynamo: Remove passing or deleted dynamo_expected_failures (#159691)
partially generated with
```
for TESTCASE in $(ls | cut -f1 -d'.' | grep -v CPython | uniq); do if grep "$TESTCASE" -m 1 .. -r; then echo; else   sl rm "$TESTCASE"* ; fi; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159691
Approved by: https://github.com/xmfan
2025-08-07 21:41:50 +00:00
8147370733 Fix qembeddingbag_byte_prepack_meta to use sym_sizes (#159985)
Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead.

Test Plan:
CI

Rollback Plan:

Reviewed By: kqfu, henryoier

Differential Revision: D79744512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159985
Approved by: https://github.com/jerryzh168, https://github.com/henryoier
2025-08-07 21:22:29 +00:00
e619c6bb90 [export] Apply move_to_device_pass to all submodules (#159992)
Previously we only applied this move_to_device_pass to the toplevel graph. However if we have HOO, this pass will not be applied on the HOO submodules. This PR modifies the pass to run on all submodules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159992
Approved by: https://github.com/yiming0416
2025-08-07 18:51:15 +00:00
3cf7b4024e [DTensor] Support user-supplied Generator for random ops (#159933)
If the user provides a generator kwarg to a random op (e.g.
nn.init.uniform_(..., generator=my_generator)), we can still advance
that generator's state in a SPMD-global way so that each local-tensor
gets appropriate values and the generator advances to the same state as
if it had operated on the full tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159933
Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol
2025-08-07 18:47:22 +00:00
21392c0e06 [inductor] disable flex decoding on Windows. (#160072)
Discussed with @jianan-gu and @Valentine233 , disable flex decoding on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160072
Approved by: https://github.com/angelayi
2025-08-07 18:07:36 +00:00
ee1fb43450 Fix docker image creation (#158634)
Since switching from wheel 0.34.2 to wheel 0.45.1
python symlinks are no longer correctly created.

Migrate to packaging package for symlink creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158634
Approved by: https://github.com/malfet
2025-08-07 17:41:47 +00:00
0bd3af4fb8 Further fix failing tests in test/inductor/test_analysis.py (#160070)
This is a follow up on #159800 as other tests are still failing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160070
Approved by: https://github.com/aorenste
2025-08-07 17:32:58 +00:00
8399cf88ce Use only safetensors APIs in HFStorageReader (#159681)
Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs

Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681
Approved by: https://github.com/saumishr
ghstack dependencies: #159405, #159406
2025-08-07 17:23:03 +00:00
0b187b3114 DCP HF reader: use safe_open instead of reading the bytes (#159406)
Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after.

Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406
Approved by: https://github.com/saumishr
ghstack dependencies: #159405
2025-08-07 17:23:03 +00:00
69cc606fda HF component update to not use fsspec components (#159405)
Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it

Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405
Approved by: https://github.com/saumishr
2025-08-07 17:22:54 +00:00
57f738b635 [inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983
Approved by: https://github.com/eellison
ghstack dependencies: #158758
2025-08-07 17:07:26 +00:00
e167c7d0f3 [inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)
Fixes #155121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang, https://github.com/eellison
2025-08-07 17:07:26 +00:00
b1a602762e [Profiler] Update README (#159816)
Summary: Updated README with code structure and explanation of core features within profiler

Test Plan:
N/A

Rollback Plan:

Differential Revision: D79604189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159816
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2025-08-07 16:44:41 +00:00
e1cf0d496e [inductor] unification for inductor debug. (#159998)
Unification inductor debug build, follow @desertfire 's suggestion: https://github.com/pytorch/pytorch/pull/159938#pullrequestreview-3093803196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159998
Approved by: https://github.com/angelayi
2025-08-07 16:38:00 +00:00
06824f3c72 [inductor] fix test_dynamo_timed on Windows. (#159981)
Fixed `test_dynamo_timed `:
<img width="1030" height="389" alt="image" src="https://github.com/user-attachments/assets/02d84dd8-6a65-4f91-8d4c-48ba0a81fac1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159981
Approved by: https://github.com/angelayi
2025-08-07 16:37:52 +00:00
f3a4d742ec Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
74da2604c9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 15f1173e5d72d6d45faba4cecd135e0160f06c6f.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
c4e64467b5 Revert "Add UT for torch.accelerator memory-related API (#155200)"
This reverts commit 4604f0482c2b4a3001b62e5bc5085149a9bb053c.

Reverted https://github.com/pytorch/pytorch/pull/155200 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
90b78ee50f Move xla jobs to unstable workflow (#159272)
Disables the job on PRs completely, so that we don't litter people's CI signals and use machines unnecessarily.

If you want to run these xla tests, add the ciflow/unstable label to your PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159272
Approved by: https://github.com/atalman, https://github.com/malfet
2025-08-07 16:22:52 +00:00
e248719ac0 [DTensor] support _StridedShard in view op (#159656)
**Summary**
Some thoughts on view-op and `_StridedShard` interaction:
1. `_StridedShard` has no impact on sharding (i.e. how tensor is partitioned)
compared to `Shard`. It only changes how shards permute across the devices.
2. `view()` op on DTensor strictly forbids shard redistribution which means if
`view()` may cause shard permutation across devices, it should be rejected.
This is enforced in today's sharding prop for `view()`.
3. Since DTensor `view()` won't introduce any redistribution, it's certain that
`placements` won't change except the inner `dim` attribute of `Shard`
or `_StridedShard`.

Therefore, to support `_StridedShard` in `view()` op, the only change required
is to keep `_StridedShard` as `_StridedShard` in the output spec.

**Test**
`pytest test/distributed/tensor/test_view_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159656
Approved by: https://github.com/wconstab
2025-08-07 15:59:25 +00:00
f60454cce8 S390X: update test dependencies (#158636)
numba currently doesn't build from source due to
https://github.com/numba/numba/pull/10073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158636
Approved by: https://github.com/malfet
2025-08-07 15:58:30 +00:00
8ab5868a21 Actually run the einops tests in CI (#159776)
The test filter was wrong, it should not start with "test/".

Test Plan:
- wait for CI
- Tested locally with `python test/run_test.py --einops --verbose`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159776
Approved by: https://github.com/atalman, https://github.com/StrongerXi
2025-08-07 15:23:06 +00:00
d20c4c20e6 [CI] Update xpu ci use rolling driver for new features (#158340)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158340
Approved by: https://github.com/seemethere

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-08-07 15:18:51 +00:00
83875cdb55 [nativert] Expose ModelRunner to public through pmpl type ModelRunnerHandle. (#159989)
Summary:
Today users outside of pytorch core cannot `#include <torch/nativert/ModelRunner.h>`.

It turns out that we should place a header inside `torch/csrc/api/include/`. Placing every single nativert header here would pollute the namespace a lot and that's not what we want in general. Therefore here we just create a Handle type which hold a pointer to decouple the actual type from header definition.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79751098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159989
Approved by: https://github.com/dolpm
2025-08-07 14:23:21 +00:00
a53d14d5f8 Revert "unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786)"
This reverts commit 3a2c3c8ed365eb4e4cf4620c25d70b2f70483762.

Reverted https://github.com/pytorch/pytorch/pull/157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/157786#issuecomment-3164126250))
2025-08-07 13:09:33 +00:00
8cb91e20bc Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908)
This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-08-07 11:24:44 +00:00
b0df7715e8 Remove benchmark dependencies from regular ROCm CI images (#160047)
Instead, use a new `pytorch-linux-jammy-rocm-n-py3-benchmarks` image for Docker benchmark job.  This addresses 2 issues:

* The current ROCm failures in trunk w.r.t librosa version https://github.com/pytorch/pytorch/actions/runs/16789466749/job/47549950994 that TorchBench pulls in.
* Reduce the size of the regular ROCm CI images by removing TorchBench models, which is needed only for benchmarking jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160047
Approved by: https://github.com/malfet, https://github.com/izaitsevfb
2025-08-07 09:26:58 +00:00
422bd6808b dataclass pytree fix (#159916)
Differential Revision: D79687243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159916
Approved by: https://github.com/XuehaiPan, https://github.com/angelayi
2025-08-07 08:22:41 +00:00
24f43d0da7 [inductor] [cpu] fix the dype hardcoded to int64 in store_reduction (#157904)
## Fixes https://github.com/pytorch/pytorch/issues/157683

## mini repro
* Just copy the code from the issue to reproduce it.
```python
import torch

device = "cpu"

# Input tensors
v2_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device)
v3_0 = torch.randn(16, 24, 59, dtype=torch.complex64, device=device)

def my_model(v2_0, v3_0):
    v6_0 = -v3_0
    v4_0 = v2_0 * v3_0
    v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
    v0_0 = v2_0.to(torch.int32)
    v5_0 = v0_0.amax(dim=0)

    return v6_0, v4_0, v1_0, v0_0, v5_0

v6_0, v4_0, v1_0, v0_0, v5_0 = my_model(v2_0, v3_0)
print("v6_0", v6_0.shape)
print("v4_0", v4_0.shape)

compiled_model = torch.compile(my_model, backend="inductor")

v6_0, v4_0, v1_0, v0_0, v5_0 = compiled_model(v2_0, v3_0)

print("v6_0", v6_0.shape)
print("v4_0", v4_0.shape)
print("v1_0", v1_0.shape)
print("v0_0", v0_0.shape)
print("v5_0", v5_0.shape)

```
error_stack
```
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’
   41 | convert(const Vectorized<src_t>& src) {
      | ^~~~~~~
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:  template argument deduction/substitution failed:
/tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误:模板参数数目不对(不应是 4 个而应是 2 个)
   37 |                     auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
## summary
**The C++ kernel generated by the Inductor had the wrong data type for the output variable; it should be int32_t instead of int64_t. This incorrect data type led to an incompatible data type conversion, which caused the g++ compilation to fail.**
The original code that caused the problem.
```
def my_model(v2_0, v3_0):
    v6_0 = -v3_0
    v4_0 = v2_0 * v3_0
    v1_0 = v4_0.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).unsqueeze(-1)
    v0_0 = v2_0.to(torch.int32)
    // The original code that caused the problem.
    v5_0 = v0_0.amax(dim=0)
```

## proof procedure
The c++ kernel generated by inductor:
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void kernel(const int32_t* in_ptr0,
                       int32_t* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(1416L); x0+=static_cast<int64_t>(16L))
        {
            {
                int32_t tmp_acc0_arr[16];
                for (int i = 0; i < 16; i++)
                {
                    tmp_acc0_arr[i] = std::numeric_limits<int32_t>::min();
                }
                int32_t tmp_acc0 = std::numeric_limits<int32_t>::min();
                at::vec::Vectorized<int32_t> tmp_acc0_vec = at::vec::Vectorized<int32_t>(std::numeric_limits<int32_t>::min());
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L)))
                        {
                            auto tmp0 = at::vec::Vectorized<int32_t>::loadu(in_ptr0 + static_cast<int64_t>(x0 + 1416L*x1), static_cast<int64_t>(16));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L)))
                        {
                            for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++)
                            {
                                auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail + 1416L*x1)];
                                tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)] = max_propagate_nan(tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)], tmp0);
                            }
                        }
                    }
                }
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(1408L)))
                {
                   // impossible data type conversion which would caused the g++ compilation to fail.
                    auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
                    int32_t_tmp_acc0_vec.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(1408L) && x0 < static_cast<int64_t>(1416L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(1408L);x0_tail < static_cast<int64_t>(1416L); x0_tail++)
                    {
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp_acc0_arr[x0_tail - static_cast<int64_t>(1408L)];
                    }
                }
            }
        }
    }
}
```
the compilers complains
```text
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:candidate: ‘template<class dst_t, class src_t> std::enable_if_t<(! is_same_v<dst_t, src_t>), at::vec::CPU_CAPABILITY::Vectorized<T> > at::vec::CPU_CAPABILITY::convert(const at::vec::CPU_CAPABILITY::Vectorized<T>&)’
   41 | convert(const Vectorized<src_t>& src) {
      | ^~~~~~~
/home/admin/pytorch/pytorch/torch/include/ATen/cpu/vec/vec_convert.h:41:1: 附注:  template argument deduction/substitution failed:
/tmp/torchinductor_admin/6k/c6kr65o43rlmp2cmkpn5ezewhe5bla4w72hpcrg5biyelrs4skyw.main.cpp:37:99: 错误:模板参数数目不对(不应是 4 个而应是 2 个)
   37 |                     auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
so the following line have problem
```c++
    // this line means that tmp_acc0_vec should be Vectorized<int64_t>, and it will convert it to Vectorized<int32_t>.
    auto int32_t_tmp_acc0_vec = at::vec::convert<int32_t,1,int64_t,2>(tmp_acc0_vec);
```
The issue is that tmp_acc0_vec is of type Vectorized<int32_t>, but the template parameters expect it to be Vectorized<int64_t>.  and it will convert it to a Vectorized<int32_t>. this is conflict. the conversion should not be exist for tmp_acc0_vec is already Vectorized<int32_t>.The following line hardcodes the output variable type to int64, which causes unnecessary and incorrect type conversions.
d89f30ad45/torch/_inductor/codegen/cpp.py (L2985-L2993)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157904
Approved by: https://github.com/jgong5
2025-08-07 08:03:05 +00:00
aa75e917bd [Export Schema] Remove deviceAllocationMap field (#159653)
Summary:
This field is not used today, and it's not useful either.

The device allocation is configured at model loading time, specified by user.
It shouldn't be part of the model definition.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79385513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159653
Approved by: https://github.com/zhxchen17
2025-08-07 07:31:42 +00:00
3f1636ebef [audio hash update] update the pinned audio hash (#160046)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160046
Approved by: https://github.com/pytorchbot
2025-08-07 04:16:35 +00:00
c859ba7114 Make onnx export SDPA match aten behavior (#159973)
This PR makes onnx sdpa export match the behavior of aten sdpa when boolean mask is used.
@justinchuby

```python
import onnxruntime as ort
import torch

class ScaledDotProductAttention(torch.nn.Module):
    def forward(self, query, key, value, attn_mask):
        return torch.nn.functional.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask)

model = ScaledDotProductAttention()
attn_mask = torch.ones(2, 4, 8, 8).bool()  # boolean mask for attention
attn_mask[0, 0, 0, :] = False  # masking an entire row (padding token)
query = key = value = torch.randn(2, 4, 8, 16)
output = model(query, key, value, attn_mask)

torch.onnx.export(
    model,
    (query, key, value, attn_mask),
    "scaled_dot_product_attention.onnx",
    input_names=["query", "key", "value", "attn_mask"],
    output_names=["output"],
    dynamo=false, # or True,
)
ort_session = ort.InferenceSession("scaled_dot_product_attention.onnx")

np_inputs = {"query": query.numpy(), "key": key.numpy(), "value": value.numpy(), "attn_mask": attn_mask.numpy()}
onnx_outputs = ort_session.run(None, np_inputs)[0]

torch.testing.assert_close(output, torch.tensor(onnx_outputs), equal_nan=True)
```
fails the assertion because the ort model outputs nans.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159973
Approved by: https://github.com/xadupre, https://github.com/titaiwangms
2025-08-07 04:06:07 +00:00
d4c1a08c89 Relax unclaimed successes in dtype op tests when running under TEST_WITH_DYNAMO/TEST_WITH_INDUCTOR (#159976)
This PR changes the behavior for compile wrapped op tests:
- supported_but_unclaimed_forward
- supported_but_unclaimed_backward

These typically manifest when the op doesn't support inputs of certain dtypes. But under torch.compile, Dynamo/AOTAutograd will trace the graph with FakeTensors, which @ezyang and @eellison tell me need to run decomps before op dispatch. The decomp may map this test to a different op, one that does support the dtype. I suspect all of our failures here are due to decomps, and so I propose to just disable this check for compile.

~~TODO: re-enable all the failed tests.~~ jk there were no failed tests outside of compiled autograd due to this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159976
Approved by: https://github.com/ezyang
2025-08-07 02:38:45 +00:00
81d72fb1f7 Move smoke binary builds to 3.12 (#159993)
And limit them just to stable CUDA version (as there weren't any recent instances when only one of those jobs failed to build)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159993
Approved by: https://github.com/ngimel
ghstack dependencies: #159986, #159990
2025-08-07 01:59:30 +00:00
d0226719a9 [BE][EZ] Delete remains of split-build logic (#159990)
Hopefully last piece of https://github.com/pytorch/pytorch/issues/138750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159990
Approved by: https://github.com/atalman
ghstack dependencies: #159986
2025-08-07 01:59:30 +00:00
38d65c6465 Add a USE_NIGHTLY option to setup.py (#159965)
If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X.

Coded with claude code.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965
Approved by: https://github.com/malfet
2025-08-07 01:44:20 +00:00
2ba2f598f3 [Dynamo] Add torch.xpu.stream to trace rules (#159844)
# Motivation
Previously, I thought using `with stream:` was sufficient. However, many older scripts still use `torch.xpu.stream` as the context manager. To maintain backward compatibility, I had to include `torch.xpu.stream` in the trace rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159844
Approved by: https://github.com/jansel
2025-08-07 01:35:50 +00:00
1bb5e6c076 update expected results (#159867)
refresh due to https://github.com/pytorch/pytorch/pull/159696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867
Approved by: https://github.com/masnesral
2025-08-07 01:18:36 +00:00
8b0be7b65a [Profiler] Fix unexpected C return events (#159574)
The fix in https://github.com/pytorch/pytorch/pull/155446 addressed the "stack empty" issue that's easily reproducible on CPython 3.12.0-4. While this issue can also appear in other versions, it's not as easy to reproduce there.

I recently found a new cause for this problem.

1df5d00145/Python/ceval.c (L5807-L5836)

In the CPython 3.10 implementation, PyTrace_C_CALL and PyTrace_C_RETURN/PyTrace_C_EXCEPTION are supposed to appear in pairs. However, when c_profilefunc is changed, unexpected PyTrace_C_RETURN/PyTrace_C_EXCEPTION events can occur.

Here is the code to reproduce this problem.

```
import threading
import time
import torch

from threading import Event, Lock

lock = Lock()
lock.acquire()

event1 = Event()
event2 = Event()
event3 = Event()

def run():
    event1.set()
    event2.wait()
    lock.acquire()
    event3.set()

threading.Thread(target=run).start()

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True):
    event1.wait()
    event2.set()
    time.sleep(1)

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], with_stack=True):
    lock.release()
    event3.wait()
```

<img width="1766" height="1250" alt="image" src="https://github.com/user-attachments/assets/6794eeca-7364-429e-91eb-62cdad116bd3" />

To fix this problem, we can record active_frames_ and remaining_start_frames_ for each thread, and when the PyTrace_C-RETURN/PyTrace_CEXT CEPTION event occurs, we can determine whether to record this event based on these two fields.

In reality, even without this fix, the final data appears to be right since the match process can handle this case (it would just result in an exception log being printed).

Do you think the fix is necessary?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159574
Approved by: https://github.com/sraikund16
2025-08-07 01:17:55 +00:00
5cedc5a0ff [BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144552
Approved by: https://github.com/ezyang
2025-08-07 00:09:56 +00:00
fd606a3a91 [dynamo] update pytorch-labs -> meta-pytorch in graph break URLs (#159975)
Related PR: https://github.com/meta-pytorch/compile-graph-break-site/pull/30

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159975
Approved by: https://github.com/Lucaskabela
2025-08-06 23:57:31 +00:00
3daef4d128 [dynamo] Trace nn.Module __delattr__ (#159969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159969
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/StrongerXi
2025-08-06 23:43:19 +00:00
cb4b29b754 Revert "[pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874)"
This reverts commit 9fd5b5f73589cf08dca60910368cc0f05c7906c8.

Reverted https://github.com/pytorch/pytorch/pull/159874 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/159874#issuecomment-3161896978))
2025-08-06 23:21:29 +00:00
a6bc296207 [FlexAttention] Update the guard semantics for divisibility (#159884)
We don't add guards unless we know (and another guard has ensured this) that this is a safe optimization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159884
Approved by: https://github.com/Chillee
2025-08-06 23:12:44 +00:00
64dc30c213 [HOP, map] Rework of map autograd to the new interface (#153343)
This PR reworks the current autograd implementation of map to the new interface.

@pytorchbot label "topic: not user facing"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153343
Approved by: https://github.com/ydwu4
2025-08-06 23:02:42 +00:00
93da9952a7 gloo: fix building system gloo with CUDA/HIP (#146637)
Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.

Fixes: #146239

Reported-by: Adam J Stewart <ajstewart426@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637
Approved by: https://github.com/malfet
2025-08-06 22:56:31 +00:00
3a2c3c8ed3 unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786)
These tests now pass on AArch64 in our downstream CI.

`test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157786
Approved by: https://github.com/jerryzh168, https://github.com/malfet
2025-08-06 22:41:07 +00:00
9fd5b5f735 [pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#159874)
Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.

Test Plan:
See: D79456310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159874
Approved by: https://github.com/c00w
2025-08-06 22:33:04 +00:00
2507ae63f2 Partitioner: Fix to align partition node order with original graph (#157892)
Fixes #157891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157892
Approved by: https://github.com/ezyang
2025-08-06 22:12:47 +00:00
40c4d61f9a [Dynamo][Better Engineering] Typing torch/_dynamo/guards.py (#159315)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/guards.py`

Running
```
mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2030 | 3945 | 51.46% | 70 | 138 | 50.72% |
| This PR | 4055 | 4055 | 100.00% | 138 | 138 | 100.00% |
| Delta    | +2025 | +90 | +48.54% | +68 | 0 | +49.28% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159315
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-08-06 21:52:14 +00:00
a5725965ea Remove unnecessary "# noqa: set_linter" comments (#159467)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159467
Approved by: https://github.com/eellison
2025-08-06 21:31:52 +00:00
289f62ce8a [inductor][ez] fixup scaled_mm (#159948)
Summary:

This reverts the part of #159383 for scaled_mm where now, like before,
we pass through the normal input_nodes (not the triton_input_nodes)
to select_algorithm

- #159383 refactored how kwargs are retrieved
- it introduced this notion of KernelInputs that wrap input_nodes
- scaled_mm uses unsqueezed input nodes for triton to retrieve params
- the issue: it uses a squeezed (regular) bias for select_algorithm
  instead

This fixes that by passing the original input nodes rather
than the triton input nodes.

Test Plan:

```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)'
buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)'
```

This set of tests was failing, and is passing now

Side note: these tests were failing I believe because the unsqueezed
bias made the ATEN choice no longer eligible, and there is some minor
numerical discrepancy between ATEN and Triton for this. I'm not sure
the test should be written like that, as we're implicitly relying on
ATEN being the choice here.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159948
Approved by: https://github.com/izaitsevfb, https://github.com/eellison
2025-08-06 21:25:48 +00:00
512b4730e3 [EZ] Remove useless cross_compile_arm64 (#159986)
As we don't have any Intel Mac runners in CI for last 2+ years
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159986
Approved by: https://github.com/atalman
2025-08-06 21:01:05 +00:00
d2368aa6f3 [CPUBLAS] add macros for brgemm APIs for versioning (#158629)
**Summary**
Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158629
Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang
2025-08-06 20:54:05 +00:00
0afaeb7c4e Improve extract_test_fn (#158637)
The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637
Approved by: https://github.com/jbschlosser
2025-08-06 20:45:21 +00:00
50580b5053 Add minimal nn.functional.log_softmax support for NestedTensor (#159662)
This only works for the jagged layout and for the non-batch and non-jagged dimensions.

I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159662
Approved by: https://github.com/jbschlosser
2025-08-06 20:34:02 +00:00
b8ef60b6bc Enable XNNPACK aarch64 builds (#159762)
Summary:
This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device.

Thanks to andrewjcg for proposing this fix.

Rollback Plan:

Reviewed By: andrewjcg

Differential Revision: D79497613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159762
Approved by: https://github.com/frankseide, https://github.com/malfet

Co-authored-by: Frank Seide <seide@meta.com>
2025-08-06 20:20:32 +00:00
0de2a45a48 [BE] Merge 3 CUDA build jobs into one (#159890)
Before this change there were build+test jobs:
 - s89 build+tests
 -  sm75 build+distributed_test
 - sm_75 build+pr_time_benchmark test
This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159890
Approved by: https://github.com/clee2000, https://github.com/seemethere
2025-08-06 20:09:55 +00:00
12a54e4ac1 [Inductor UT][Fix XPU CI] Fix case failures introduced by community. (#159759)
Fixes #159631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159759
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-06 20:02:20 +00:00
d10e9e4781 [MPS] Remove all pre-MacOS14 logic (#159912)
Delete older enums, checks for MacOS-13.3+ for int64 support, etc

Fixes https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159912
Approved by: https://github.com/manuelcandales
2025-08-06 19:48:12 +00:00
c71950907d [inductor] add _get_inductor_debug_symbol_cflags for debug symbol control. (#159938)
We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol.
On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG.
On Linux, it should create some debug sections in binary file.

I added UT for it also.

It works well on Windows inductor debug.
<img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159938
Approved by: https://github.com/jansel, https://github.com/angelayi
2025-08-06 19:31:45 +00:00
6fa3592dc6 Dataloader benchmark script (#159432)
This script adds a simple dataloading benchmark tracking throughput and memory.

The output looks like this
```
System Information:
  PyTorch version: 2.9.0a0+gitf87d117
  PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py
  Torchvision version: 0.24.0a0+f52c4f1
  Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py
  CUDA available: True
  CUDA device: NVIDIA PG509-210
  CPU count: 192
  Physical CPU cores: 96
  Total system memory: 1510.11 GB

Loading dataset from imagenet/val (1 copies)
Dataset size: 50000

--- Benchmarking DataLoader with worker_method=multiprocessing ---
Memory before DataLoader creation: 500.59 MB

Detailed memory information:
  USS (Unique Set Size): 499.00 MB
  PSS (Proportional Set Size): 500.74 MB
  RSS (Resident Set Size): 497.39 MB
Memory after DataLoader creation: 1127.61 MB
Memory increase: 627.02 MB
Starting training loop with 1 epochs (max 100 batches per epoch)
Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB
Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB
Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB
Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB
Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB
Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB
Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB
Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB
Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB
Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB

Results:
  Worker method: multiprocessing
  DataLoader init time: 0.1553 seconds
  Average batch time: 0.3408 seconds
  Samples per second: 375.53
  Peak memory usage: 12738.76 MB
  Memory increase: 12238.17 MB
```

> TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159432
Approved by: https://github.com/ramanishsingh
2025-08-06 19:05:19 +00:00
ba37f589d4 Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696)"
This reverts commit ee62177c196d716fc3a2d641370bed8a673a45d3.

Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/159696#issuecomment-3161196192))
2025-08-06 18:41:05 +00:00
44dd3684d2 [AOTI] Fix memory leak from all_reduce (#159818)
Summary: This PR solves two issues:

1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak.

2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159818
Approved by: https://github.com/henryhu6, https://github.com/yushangdi
2025-08-06 18:11:14 +00:00
c669b0ab87 Fix execution frame cleanup logic (#158717)
Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic.

Test Plan:
```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```

Rollback Plan:

Differential Revision: D78621408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158717
Approved by: https://github.com/dolpm
2025-08-06 18:04:24 +00:00
d7a855d67d [async-TP] Make scaled-mm + reduce-scatter preserve alignment of scales (#159957)
After https://github.com/pytorch/pytorch/pull/157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159957
Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre
2025-08-06 17:42:26 +00:00
4c01991b38 [DCP][Prototype] Checkpoint replication via PGTransport (#157963) (#159801)
Summary:

### PR Context

Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners".

Test Plan:
CI

Rollback Plan:

Differential Revision: D79590797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801
Approved by: https://github.com/saumishr
2025-08-06 16:52:03 +00:00
a4b07fe8f6 [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-08-06 15:59:27 +00:00
d87161c3c8 [Easy] Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (#159904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159904
Approved by: https://github.com/janeyx99
2025-08-06 15:09:18 +00:00
79eca4677b [precompile] Skip serializing unnecesssary objects for guards. (#158926)
Summary:
The following type of objects don't need to be serialized for precompile:
1. PyCapsule because we don't guard on C binding objects in meaningful ways.
2. Code object because we only id matching on these but id matches will always be dropped for precompile.
3. Nested function objects since we also ban CLOSURE_MATCH.

Test Plan:
buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects

Rollback Plan:

Differential Revision: D78816888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926
Approved by: https://github.com/jamesjwu
2025-08-06 15:00:28 +00:00
2855688a1d Revert "Replace C array with std::array in formatSockAddr (#159812)"
This reverts commit e7feedf6a9bb346ad205796aa4084c8dcfb18072.

Reverted https://github.com/pytorch/pytorch/pull/159812 on behalf of https://github.com/malfet due to Looks like it broke distribtued tests, see 2231c3ca3a/1 ([comment](https://github.com/pytorch/pytorch/pull/159812#issuecomment-3160513656))
2025-08-06 14:55:48 +00:00
2231c3ca3a [CI][CD] Fix install_nvshem function (#159907)
When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder

Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907):
```
2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to:  ''
2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at:  ''
2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB:  '/usr/local/cuda/lib64/libnvshmem_host.so'
2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB:  '/usr/local/cuda/lib64/libnvshmem_device.a'
2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR:  '/usr/local/cuda/include'
2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-08-06 14:44:37 +00:00
c03a734ba1 [OpenReg] Disable automatic inclusion of data files (#159845)
# Background

After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime.

# Motivation

This PR aims to remove the stub.c file and any unused file when running torch_openreg.

**Changes:**

- Setting **include_package_data** keyword to false in the setup function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845
Approved by: https://github.com/albanD
2025-08-06 10:35:13 +00:00
98316e5896 [WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325)
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849

**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads

**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]

where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path

Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769

```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/

Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6

**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:

Differential Revision: D79042656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
2025-08-06 10:28:08 +00:00
23cf241039 [aoti][mps] Initialize mps kernels first (#159753)
In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance:
```
if ...
    auto mps_lib_0_func = ...
    mps_lib_0_func->run()

// since we already used mps_lib_0 once, we don't re-initialize it
mps_lib_0_func->run()  // error, mps_lib_0_func not initialized
```

So the solution we took here is to initialize all the kernels at the beginning:
```
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
    static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
    return func;
}
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
    static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
    return handle;
}
...
if ...
    get_mps_lib_0()->run()

get_mps_lib_0()->run()  // success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753
Approved by: https://github.com/malfet
ghstack dependencies: #159456, #159695
2025-08-06 07:54:29 +00:00
e7feedf6a9 Replace C array with std::array in formatSockAddr (#159812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159812
Approved by: https://github.com/Skylion007
2025-08-06 07:44:29 +00:00
dad2a05bec [DTensor] Set up DTensorContinuousTestBase (#159885)
Also migrate `test_common_rules.py` since it was a short file

`python test/distributed/tensor/test_common_rules.py`

Before:
Ran 10 tests in 91.516s
After:
Ran 10 tests in 5.604s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885
Approved by: https://github.com/ezyang
2025-08-06 07:40:31 +00:00
0495cab545 Wire in pt2_triton_builds (#159897)
Summary:
This allows us to start seeing the failure rate on these models (and
potentially alert on it).

Test Plan:
```
FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 | tee out
```
P1889607054

Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon.

Rollback Plan:

Reviewed By: masnesral

Differential Revision: D79308333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897
Approved by: https://github.com/masnesral
2025-08-06 07:39:51 +00:00
abfe403981 [AIDIR] Internal util function to insert MLHub debugging insight for dynamic shape (#159391)
Summary:
This feature is Meta internal only
Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub .

The rollout will be controlled by JK.

Test Plan:

MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21

 {F1980593060}

* If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility'
* The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page.
* Test fails because of the export checks. I'll export after addressing all the comments from reviewers.

Rollback Plan:

Reviewed By: pianpwk

Differential Revision: D78526522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391
Approved by: https://github.com/jingsh
2025-08-06 07:39:39 +00:00
1690c0c3a0 [Reland] Migrate ScalarType to headeronly (#159911)
The non ghstack version of #159416, to make sure we don't get reverted again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159911
Approved by: https://github.com/mikaylagawarecki
2025-08-06 07:36:37 +00:00
e9d27aa8fd [CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854)
CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854
Approved by: https://github.com/eqy
2025-08-06 06:03:58 +00:00
2457e62c90 Revert "Set PYTHONHOME for inductor subprocesses using torch (#159382)"
This reverts commit fe8984a9f43bde10d1956abe7cb40710ed7ceed2.

Reverted https://github.com/pytorch/pytorch/pull/159382 on behalf of https://github.com/malfet due to Broke MacOS testing see d0fccbc99c/1 ([comment](https://github.com/pytorch/pytorch/pull/159382#issuecomment-3157455367))
2025-08-06 05:30:20 +00:00
d0fccbc99c [CI] Delete sm86 tests from pull (#159903)
And delete sm89+cuda12.4 builds from periodic (as sm86+legacy driver should be enough)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159903
Approved by: https://github.com/huydhn
2025-08-06 05:16:55 +00:00
3461988a4b [audio hash update] update the pinned audio hash (#159823)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159823
Approved by: https://github.com/pytorchbot
2025-08-06 05:02:35 +00:00
9764981116 Pass fw/bw compilers to aot_export_joint_with_descriptors (#159814)
Allow overriding nop compilers with real ones when using this flow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159814
Approved by: https://github.com/fmassa
2025-08-06 04:50:56 +00:00
704594eb23 [Dynamo] make HOPs hashable (#159910)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159910
Approved by: https://github.com/yf225
2025-08-06 04:02:17 +00:00
eqy
bfc27cf468 [Distributed] Fix @parametrize on unordered iterable in distributed test (#159793)
seems to fix https://github.com/pytorch/pytorch/issues/145807

sets aren't ordered so `@parametrize` can cause two processes to spawn with different settings

originally debugged thanks to @k-artem, see https://github.com/pytorch/pytorch/issues/145807#issuecomment-2971009451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159793
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2025-08-06 03:51:42 +00:00
311f74089a remove print (#159917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159917
Approved by: https://github.com/laithsakka
2025-08-06 03:48:23 +00:00
14c7358c64 Enable fr_trace to read local traces from multiple hosts. (#159490)
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.

Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```

Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```

After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
    dump = pickle.load(infile)
           ^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).

Rollback Plan:

Differential Revision: D79224727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
2025-08-06 03:15:34 +00:00
8ce81bcee1 [Torch Package] Make get names of OrderedImporters support fallback to importers (#155743)
Summary:
OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class.
This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers.

Differential Revision: D76463252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743
Approved by: https://github.com/jcwchen, https://github.com/jingsh
2025-08-06 02:26:10 +00:00
4604f0482c Add UT for torch.accelerator memory-related API (#155200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155200
Approved by: https://github.com/albanD
ghstack dependencies: #138222, #152932
2025-08-06 02:22:18 +00:00
15f1173e5d Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-06 02:22:18 +00:00
e16c48ae97 [BE] Fix type hint in AOTIRunnerUtil (#159577)
Not sure why it was labelled as list in the first place. In test_aot_inductor.py, I scanned a few use cases and they are tuple as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159577
Approved by: https://github.com/Skylion007
2025-08-06 01:20:45 +00:00
f7a66da5f9 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-06 00:40:29 +00:00
3eb3da9b4b [dynamo][guards] Skip ID_MATCH guard on self.__class__.__closure__ (#159888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159888
Approved by: https://github.com/williamwen42
2025-08-06 00:36:43 +00:00
3ddfd46bd2 Cut a version of TORCH_ERROR_CODE_CHECK in headeronly from AOTI (#159604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159604
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-08-06 00:29:56 +00:00
6a82da392e [export] Fix generated schema for C++20/23 (#159871)
Summary: Fixing the issue from https://github.com/pytorch/pytorch/issues/159838

Test Plan:
buck run caffe2/:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/

Rollback Plan:

Differential Revision: D79647167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159871
Approved by: https://github.com/malfet
2025-08-06 00:23:05 +00:00
22bedc429f Extract some HOP utils to be importable (#159705)
Useful helper function for stage 1 export -> manual partitioner -> stage 2 compile users

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159705
Approved by: https://github.com/zou3519
ghstack dependencies: #159134
2025-08-05 23:59:47 +00:00
49abc0e3f8 [Take 2] Setup TorchBench in Docker (#159300)
Fix and reland https://github.com/pytorch/pytorch/pull/158613, I keep `checkout_install_torchbench` in `.ci/pytorch/macos-test.sh` script because it's still used there, and there is no Docker.

### Testing

MacOS perf nightly run https://github.com/pytorch/pytorch/actions/runs/16580798470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159300
Approved by: https://github.com/ZainRizvi
2025-08-05 23:47:42 +00:00
1052604acd fix logging setup issue for Windows.. (#159887)
When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html
Such as:
    TORCH_LOGS="+schedule,+inductor,+output_code"
On Linux, it shows as:
```cmd
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x TORCH_LOGS="+schedule,+inductor,+output_code"
declare -x USER="xu"
```
On Windows, it shows as:
```cmd
TORCHINDUCTOR_WINDOWS_TESTS=1
TORCH_LOGS="+schedule,+inductor,+output_code"
UCRTVersion=10.0.22000.0
```
For Linux, it shows quotes by default, And Windows is not shows quotes.
Besides that, Windows would auto assemble quotes when env var processing.

On Linux, we will get variable: "+schedule,+inductor,+output_code"
On Windows, we will get variable: '"+schedule,+inductor,+output_code"'

So, we need remove the outer quotes for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887
Approved by: https://github.com/angelayi
2025-08-05 23:44:38 +00:00
fe8984a9f4 Set PYTHONHOME for inductor subprocesses using torch (#159382)
Summary:
This is needed for subprocesses that are trying to call back into torch
functionality, i.e. anything that's also setting `PYTHONPATH`.  There are more
`sys.executable` subprocesses in torch/ but it seems like they're fine.

Test Plan: Local inference runs.

Reviewed By: aorenste

Differential Revision: D79124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382
Approved by: https://github.com/aorenste
2025-08-05 23:32:48 +00:00
74a754aae9 Add meta kernel for sdpa_math_for_mps (#159695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695
Approved by: https://github.com/malfet
ghstack dependencies: #159456
2025-08-05 22:27:06 +00:00
b1ec088113 [mps] Turn on inductor dynamic shapes tests (#159456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-08-05 22:27:06 +00:00
fb35a9ea4a [export] Improve error messages (#159881)
Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881
Approved by: https://github.com/yushangdi
2025-08-05 22:26:48 +00:00
8034b2a732 [inductor] Add TLParse artifact for logging runtime of collective and compute ops (#159730)
Summary:

- debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format
- test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730
Approved by: https://github.com/yushangdi
ghstack dependencies: #159190
2025-08-05 22:06:32 +00:00
64cc6f06b1 [Inductor] Revert minimal changes to avoid internal test failures (#159809)
The diff/PR https://github.com/pytorch/pytorch/pull/159211 caused a bunch of test failures for graph compiler(T232684410). But I couldn't figure out a forward fix so far. So with this diff/PR, I'm proposing to revert the minimal changes to resolve the test failures.

I'll continue the debugging, and re-land the reverted changes once we find out a forward fix.

Differential Revision: [D79221721](https://our.internmc.facebook.com/intern/diff/D79221721/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159809
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2025-08-05 22:05:26 +00:00
410812763b Revert "[Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)"
This reverts commit bbc0df1094b5a4dcd2cce83f8402127b07913231.

Reverted https://github.com/pytorch/pytorch/pull/159777 on behalf of https://github.com/izaitsevfb due to breaking inductor test on ROCm ([comment](https://github.com/pytorch/pytorch/pull/159777#issuecomment-3156770098))
2025-08-05 22:00:24 +00:00
bdb07a2bc5 [Cutlass] Allow offsets to be passed as arguments to kernel (#159761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159761
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #159760
2025-08-05 21:59:07 +00:00
8085edc8f9 [autograd] torch._C._set_view_replay_enabled state leaking into other tests (#159840)
This was causing view_fns to pop up in tests that ran after `TestAutograd.test_view_replay_enabled` where it isn't used as a context manager. It is unclear to me why we would want `_force_original_view_tracking` to mutate global state on __init__ rather than on __enter__, that could be an alternative fix.

FIXES https://github.com/pytorch/pytorch/issues/156306 https://github.com/pytorch/pytorch/issues/156289 https://github.com/pytorch/pytorch/issues/156265 https://github.com/pytorch/pytorch/issues/156209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159840
Approved by: https://github.com/albanD
2025-08-05 21:57:49 +00:00
882d50c5bf [C10] Add Scalar::isUnsigned() method (#159877)
That returns true if Scalar hold unsigned integral value

With the implications of `Tag::HAS_u` semantic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159877
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-08-05 21:43:21 +00:00
b52a4d0821 [ez][CI] Remove some unused docker images (#159171)
Removes unused docker images from the docker build workflow
Then removes unused definitions in build.sh

The only one I left is the vllm one because I'm pretty sure it's going to be used in the future

I assume everything not mentioned is old and we forgot to remove them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171
Approved by: https://github.com/yangw-dev
2025-08-05 21:31:53 +00:00
a45a840926 [CI] Disable check-labels and check_mergeability (#159900)
See https://github.com/pytorch/pytorch/issues/159825
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159900
Approved by: https://github.com/clee2000
2025-08-05 21:16:12 +00:00
9b953bb3fb [BE] Update TensorPipe pin (#159834)
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
-  Avoid using deprecated enums for MacOS builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
2025-08-05 20:45:09 +00:00
eb25a95a6e Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569)
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:

```
    When an operation mutates a buffer in-place, the scheduler creates a new buffer name
    to track the "before" and "after" states, even though they share the same memory.
    The mutated buffer represents a rename with zero allocation and deallocation cost.
    During dependency tracking, we transfer dependencies from the mutated name back to
    the original buffer, ensuring the original memory is only freed when all aliases
    are done.
    This handles cases where a buffer has multiple non-overlapping aliases - rather than
    trying to assign free costs to individual aliases, we forward all alias dependencies
    to the original buffer.
    Consider:
        buf0 = op0()
        buf1 = mutation_op_(buf0)
        del buf0
        ...
        op(buf1)
        del buf1
    The only memory events are the creation prior to op0, and the deletion following buf1.
```

As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.

This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
2025-08-05 19:58:11 +00:00
eqy
9884d0351e [CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522)
Otherwise we see `CUDA error: too many resources requested for launch`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522
Approved by: https://github.com/janeyx99
2025-08-05 19:26:25 +00:00
d7c83972d5 tools: Add mode to find python automatically (#159820)
Add support for automatically finding Python interpreters in manylinux
environments to our wheel building script. Scaffolding for sequential builds

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159820
Approved by: https://github.com/malfet
2025-08-05 19:19:22 +00:00
e06b110f73 [Testing] Add MPS to NATIVE_DEVICES (#153835)
This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely:
- Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64`
- Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())`
- Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed)
- Add lots of `expectedFailureIfMPS`
- Delete all `@onlyNativeDeviceTypesAnd("mps")`

&lt;sarcasm&gt; I love how well documented this variable are &lt;/sarcasm&gt;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835
Approved by: https://github.com/Skylion007
2025-08-05 18:57:35 +00:00
0ba09a6d34 fix link for tutorial of inductor on windows (#159853)
fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853
Approved by: https://github.com/sekyondaMeta

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>
2025-08-05 18:37:47 +00:00
aeb5321b63 Allow controlling PG backend and options via init_device_mesh (#159371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371
Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol
2025-08-05 12:44:14 +00:00
625108ede2 [inductor] consolidate common GEMM triton param retrieval (#159383)
\# Why

- Make loop iteration simpler
- Have a common spot where to make modifications that affect
  all the GEMM Triton templates, avoiding missed spots

\# What

- pull out commong logic of taking the BaseConfig objects
  and turning them into kwargs to feed into maybe_append_choice
  for Triton GEMM templates

Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
2025-08-05 11:42:25 +00:00
09e5a93fcb Improve graph output alias with subclass error message (#159619)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159619
Approved by: https://github.com/albanD
2025-08-05 06:47:31 +00:00
908c5cc4c0 Generalize torch._C._set_allocator_settings to be generic (#156175)
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312, #156165
2025-08-05 04:08:42 +00:00
c1145852a5 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312
2025-08-05 04:08:42 +00:00
ae1a706444 Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig (#150312)
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #159629
2025-08-05 04:08:04 +00:00
56d19a5ced Fix AllocatorConfig potential SIO issue (#159629)
# Motivation
As @ScottTodd identified in this [comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3141524874), using STL containers like `std::string` and `std::unordered_set` at static init time can cause static initialization order issues. This PR is based on and modified from his original PR: https://github.com/pytorch/pytorch/pull/159607. I’m stacking this PR here to help facilitate the landing and validation process.

Co-authored-by: @ScottTodd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159629
Approved by: https://github.com/ScottTodd, https://github.com/albanD
2025-08-05 04:07:51 +00:00
b6c53383fe [Dynamo][Better Engineering] Type annotation for torch/_dynamo/output_graph.py (#159602)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/output_graph.py`

Running
```
mypy torch/_dynamo/output_graph.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2163 | 4792 | 45.14% | 121 | 268 | 45.15% |
| This PR | 4818 | 4818 | 100.00% | 268 | 268 | 100.00% |
| Delta    | +2655 | +26 | +54.84% | +147 | 0 | +54.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159602
Approved by: https://github.com/Skylion007
2025-08-05 03:50:54 +00:00
4fd5fabee9 skip XPU for dataloader CPU only unit test (#159811)
Fixes [#159802](https://github.com/pytorch/pytorch/issues/159802)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159811
Approved by: https://github.com/izaitsevfb
2025-08-05 03:44:01 +00:00
bbc0df1094 [Inductor][Triton] Support TMA before strict 3.4 cutoff (#159777)
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Relying on CI. Should be a NFC.

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D79378792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777
Approved by: https://github.com/davidberard98
2025-08-05 03:29:13 +00:00
33ec6e3e9a Remove pin on libuv from instructions (#159504)
This package doesn't exist at conda-forge and causes some confusion for users.
see https://anaconda.org/conda-forge/libuv/files?version=1.39.0

libuv is quite stable, so the newer versions should be fine. we build with them anyway at conda-forge.

see: https://github.com/conda-forge/libuv-feedstock/issues/80

Hopefully this can help future users.

Fixes https://github.com/conda-forge/libuv-feedstock/issues/80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159504
Approved by: https://github.com/seemethere
2025-08-05 03:18:42 +00:00
efc4b460b3 Add cascade sum support for Inductor CPP backend (#156296)
Fixes #154703

Add cascade summation support for Inductor CPP backend to improve precision for large size summation.

Currently, Inductor CPP directly do reduction for sum. As shown in #154703, when the size of the sum is large and the number of parallel is small, direct reduction will cause an intolerable precision loss:
```
extern "C"  void kernel(float* in_out_ptr0,
                       const float* in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                        tmp_acc0_vec = tmp_acc0_vec + tmp0;
                    }
                }
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0);
        }
    }
    {
        {
            {
                auto tmp0 = out_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(3000000000.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
```

After adding cascade sum support:

```
extern "C"  void kernel(float* in_out_ptr0,
                       const float* in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            at::vec::Vectorized<float> masked_tmp_acc0_vec = at::vec::Vectorized<float>(0);
            CascadeSumHelper<float, 65536> scalar_cascade_helper0(static_cast<int64_t>(3000000000L));
            CascadeSumHelper<at::vec::Vectorized<float>, 65536> cascade_helper0(static_cast<int64_t>(187500000L));
            CascadeSumHelper<at::vec::Vectorized<float>, 65536> masked_cascade_helper0(static_cast<int64_t>(0L));
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(3000000000L); x0+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(3000000000L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                        tmp_acc0_vec = cascade_sum_combine(tmp0, &cascade_helper0);
                    }
                }
            }
            tmp_acc0 = cascade_sum_final(&scalar_cascade_helper0);
            tmp_acc0_vec = cascade_sum_final(&cascade_helper0);
            masked_tmp_acc0_vec = cascade_sum_final(&masked_cascade_helper0);
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float, 1>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec + masked_tmp_acc0_vec);
            out_ptr0[static_cast<int64_t>(0L)] = static_cast<float>(tmp_acc0);
        }
    }
    {
        {
            {
                auto tmp0 = out_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(3000000000.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
```
This will inevitably reduce performance when cascade sum is turned on.
For the case shown in #154703: performance reduced by ~3%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-08-05 02:54:32 +00:00
1ca8388442 [BE][MPS] Remove unused size12 variable (#159832)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable]
  auto size12 = input_sizes[1] * input_sizes[2];
       ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832
Approved by: https://github.com/dcci
2025-08-05 02:32:06 +00:00
b69497351d [nativert] force resize to zero. (#159683)
Summary:
this was quite a miserable bug. there are a few kernels that don't explicitly resize outputs to zero, which led to some weird UB.

Rollback Plan:

Differential Revision: D79476454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159683
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-08-05 02:25:31 +00:00
482f069c41 [C10D] fix slow init due to repeated dns resolution failure (#159596)
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596
Approved by: https://github.com/d4l3k
2025-08-05 02:15:26 +00:00
85d931f29e Use uppercase OR when checking for system XNNPACK (#159527)
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.

---

For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:

```
CMake Error at cmake/Dependencies.cmake:566 (if):
  if given arguments:

    "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"

  Unknown arguments specified
Call Stack (most recent call first):
  CMakeLists.txt:868 (include)
```

Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
2025-08-05 02:10:53 +00:00
8a2f53c523 Recursively sync fbgemm submodules before build (#159477)
ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622
```
2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’
2025-07-27T08:00:32.3444080Z   389 |         x86::Xmm partial_sum_xmm(partial_sum_vreg.id());
```

It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477
Approved by: https://github.com/pruthvistony, https://github.com/eqy
2025-08-05 02:00:54 +00:00
b59b61a099 Add avg_pool3d backward pass for MPS (#159089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089
Approved by: https://github.com/malfet
2025-08-05 01:55:38 +00:00
57ab39f7e4 Update torch-xpu-ops commit pin (#159621)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](1f7a57f507) includes:

- Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization
- Add optional NaN checks to XCCL
- Fix NllLossForwardReduce2DKernelFunctor accuracy
- Extend the existing communication logging to include the reduction operation for collective calls
- [Reland] Install xpu codegen header to torch/include
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621
Approved by: https://github.com/EikanWang
2025-08-05 01:46:15 +00:00
182975e01a [Dynamo] Enable torch function dispatch on HOPs (#159708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159708
Approved by: https://github.com/zou3519, https://github.com/XilunWu
ghstack dependencies: #159707
2025-08-05 01:43:22 +00:00
9f8cfe7476 [Dynamo] Fix arg ordering in tf modes (#159707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159707
Approved by: https://github.com/zou3519
2025-08-05 01:43:21 +00:00
e273ff028a Fix failing test (#159800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159800
Approved by: https://github.com/aorenste
2025-08-05 00:28:51 +00:00
5e0fc2c9a9 [AOTI] don't allow int32 indices if {non-inf, > int32_max} upper bound is provided (#159433)
**Motivation / Context**: (what I _think_ is happening here)

In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation.

But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA.

**Solution space**

Export allows users to specify which dimension are dynamic, and to provide **ranges of valid sizes**.

One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32.

However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above.

To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`.

If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges.

Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-08-05 00:17:09 +00:00
bc4b04e058 DeviceCopy should have the same layout as input (#159615)
Summary: Fix https://github.com/pytorch/pytorch/issues/159612

- Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input
- The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy
```

Rollback Plan:

Differential Revision: D79411407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615
Approved by: https://github.com/eellison
2025-08-04 23:56:58 +00:00
6b414f56a4 Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160) (#158462)" (#159798)
This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1.

Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model)

```
import torch
def ri(inp, repeats, output_size):
    return torch.repeat_interleave(inp, repeats, output_size=output_size)
inp = torch.arange(0, 4, device="cuda").reshape(-1, 1)
x = torch.tensor([1, 2, 3, 4], device="cuda")
ri_c = torch.compile(ri)
print(ri(inp, x, 10))
print(ri_c(inp, x, 10))
```

which leads to errors like

```
/tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed.
```

Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798
Approved by: https://github.com/danzimm
2025-08-04 23:39:20 +00:00
fb8f32ef52 Revert "[mps] Turn on inductor dynamic shapes tests (#159456)"
This reverts commit 19f1f9960db7f29f2110a7f49f06a1a23c651ecf.

Reverted https://github.com/pytorch/pytorch/pull/159456 on behalf of https://github.com/davidberard98 due to Sorry - this causes a merge conflict with https://github.com/pytorch/pytorch/pull/159798, which I'm trying to land with co-dev to resolve a sev ([comment](https://github.com/pytorch/pytorch/pull/159456#issuecomment-3152751821))
2025-08-04 23:11:05 +00:00
7ba996bbaa [Cutlass] Fix wrapper code generation breakage (#159760)
Fixes issues introduced by https://github.com/pytorch/pytorch/pull/159355

The issue got past OSS CI because the H100 tag wasn't added, not sure how to prevent these kinds of issues in the future, perhaps we should run H100 on Inductor PRs?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159760
Approved by: https://github.com/angelayi
2025-08-04 23:03:03 +00:00
ddbdcdc710 [cutlass backend][test] Expand FP8 tests to FP16 (#159538)
Differential Revision: [D79317343](https://our.internmc.facebook.com/intern/diff/D79317343/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159538
Approved by: https://github.com/mlazos
2025-08-04 23:01:55 +00:00
19f1f9960d [mps] Turn on inductor dynamic shapes tests (#159456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159456
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-08-04 22:44:31 +00:00
fd6655a0f5 Feature: Implement support for cudnn_batch_norm_out kernel to replace the autogen approach. (#123020)
Fixes #115611

Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency.

Test Case:

```c++
#include <torch/torch.h>
#include <iostream>
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>

int main() {
    auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA));
    auto weight = torch::randn({3}, torch::device(torch::kCUDA));
    auto bias = torch::randn({3}, torch::device(torch::kCUDA));
    auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA));
    auto running_var = torch::ones({3}, torch::device(torch::kCUDA));

    bool training = true;
    double exponential_average_factor = 0.1;
    double epsilon = 1e-5;

    auto output = torch::empty_like(input);
    auto save_mean = torch::empty({3}, torch::device(torch::kCUDA));
    auto save_var = torch::empty({3}, torch::device(torch::kCUDA));
    auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder

    at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve);
    auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon);

    bool is_close_output = torch::allclose(output, std::get<0>(outputs));
    bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs));
    bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs));
    bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs));

    std::cout << "Is output close: " << is_close_output << std::endl;
    std::cout << "Is save_mean close: " << is_close_save_mean << std::endl;
    std::cout << "Is save_var close: " << is_close_save_var << std::endl;
    std::cout << "Is reserve close: " << is_close_reserve << std::endl;

    return 0;
}
```

Please CC @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020
Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD
2025-08-04 22:40:33 +00:00
a7f3bdf550 [Dynamo][Better Engineering] Type coverage for torch/_dynamo/utils.py (#159580)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/utils.py`

Running
```
mypy torch/_dynamo/utils.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2163 | 4792 | 45.14% | 121 | 268 | 45.15% |
| This PR | 4818 | 4818 | 100.00% | 268 | 268 | 100.00% |
| Delta    | +2655 | +26 | +54.84% | +147 | 0 | +54.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159580
Approved by: https://github.com/williamwen42
2025-08-04 21:51:53 +00:00
510e8b4ae0 [inductor] use writable temp file on windows (#159738)
Use `WritableTempFile` on Windows, reference to: https://github.com/pytorch/pytorch/pull/159342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159738
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-08-04 21:51:02 +00:00
83ba3f1101 Revert "[inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)"
This reverts commit 6085bf7565fec0d2ed26e8590001f09c05adbbe4.

Reverted https://github.com/pytorch/pytorch/pull/158758 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))
2025-08-04 21:47:11 +00:00
1fad16aacb Revert "[inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)"
This reverts commit 444e2381d07a14cb501c00d11f9e63a3f1d2c86e.

Reverted https://github.com/pytorch/pytorch/pull/158983 on behalf of https://github.com/davidberard98 due to I need to revert #158462 (it causes device-side asserts), and this PR causes a merge conflict in the test file. Sorry about that! ([comment](https://github.com/pytorch/pytorch/pull/158758#issuecomment-3152490371))
2025-08-04 21:47:11 +00:00
444e2381d0 [inductor] move all cpu scalars using pinned memory for graph partition (#155360) (#158983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158983
Approved by: https://github.com/eellison
ghstack dependencies: #158758
2025-08-04 21:42:05 +00:00
6085bf7565 [inductor] allocate non-blocking copy destinations in pinned memory (#155121) (#158758)
Fixes #155121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158758
Approved by: https://github.com/EikanWang, https://github.com/eellison
2025-08-04 21:22:11 +00:00
8201dbf4bc check driver to be >=12.4 to use fabric handles (#159697)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159697
Approved by: https://github.com/malfet
2025-08-04 21:05:39 +00:00
26d045bb60 Linux py 3.14 wheel builds (#157559)
Related to https://github.com/pytorch/pytorch/issues/156856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157559
Approved by: https://github.com/malfet, https://github.com/albanD
2025-08-04 20:55:19 +00:00
356ac3103a Revert "Stop parsing command line arguments every time common_utils is imported. (#156703)"
This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3.

Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))
2025-08-04 20:37:39 +00:00
d4109a0f99 [MPS] Add max_unpool1d/2d/3d (#159789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789
Approved by: https://github.com/malfet
2025-08-04 20:00:59 +00:00
7ea789ccfb Revert #156868: Bring back symint check for sharding propagation cache (#159671)
Fixes #159601

Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671
Approved by: https://github.com/xmfan
2025-08-04 19:58:48 +00:00
7e8197e34d Revert "Migrate ScalarType to headeronly (#159416)"
This reverts commit 1371a98b0e727f8a8916dd473b6dd0cff78c0449.

Reverted https://github.com/pytorch/pytorch/pull/159416 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D79452481 ([comment](https://github.com/pytorch/pytorch/pull/159416#issuecomment-3152138508))
2025-08-04 19:55:09 +00:00
50eac811a6 [typing] Constrain OrderedSet generic to be Hashable (#159684)
Ran across this typing bug while creating an OrderedSet from a type I didn't realize wasn't hashable, which failed at runtime. With this constraint, typing would've failed pre-runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159684
Approved by: https://github.com/Skylion007
2025-08-04 18:08:01 +00:00
4e0f179d0b Update the signature and test of torch.hamming_window() (#152682)
Fixes #146590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152682
Approved by: https://github.com/albanD
2025-08-04 17:50:42 +00:00
36e59d9b12 [c10d][nvshmem] fix missing override compilation error for nvshmem symmetric code (#159557)
Summary:
Fix error when compiling nvshmem code section `NVSHMEMSymmetricMemory.cu` with BUCK

```
fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/NVSHMEMSymmetricMemory.cu:154:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  154 | virtual at::Tensor get_buffer(int
      |                    ^
fbcode/caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
   56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
```

Test Plan:
Build test + CI

Rollback Plan:

Differential Revision: D78813586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159557
Approved by: https://github.com/kwen2501
2025-08-04 17:46:30 +00:00
fc340d0ca3 [export] Allow comparing device w/o index with device w/ index (#159665)
In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665
Approved by: https://github.com/yushangdi
2025-08-04 17:00:07 +00:00
53e47af0f7 [dynamo][guards] Read the attr name from GetAttrGuardAccessor (#159754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159754
Approved by: https://github.com/jansel
ghstack dependencies: #159752
2025-08-04 16:51:27 +00:00
66ad881fc7 [dynamo][guards][refactor] Simplify type extraction from GuardManager (#159752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159752
Approved by: https://github.com/jansel
2025-08-04 16:51:27 +00:00
1d3eef27ac [ROCm CI] Migrate to MI325 Capacity (#159649)
Migrate mi300s to gfx942.

Related to https://github.com/pytorch/pytorch/pull/159059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159649
Approved by: https://github.com/huydhn
2025-08-04 16:48:12 +00:00
dd95900cec [AOTI] normalize_path_separator file path for Windows. (#159726)
`normalize_path_separator` file path for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159726
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-08-04 15:57:19 +00:00
1cdd665526 fix test_verbose_logs_dynamic_shapes with MSVC (#159573)
Operator `typeid` have different outputs in different compiler. There is a good example in [cppreference](https://www.en.cppreference.com/w/cpp/language/typeid.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159573
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-08-04 15:56:53 +00:00
7cb2dcd2dd [c10d][nvshmem] modify is_nvshmem_available runtime check to work with static-linked library (#159558) (#159561)
Summary:

Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically

Test Plan:
CI + sample run

Rollback Plan:

Differential Revision: D79177525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561
Approved by: https://github.com/kwen2501
2025-08-04 15:40:29 +00:00
e5a81aa7ba Fix conversion of values in libtorch agnostic tests (#155115)
Due to different byteorder,
when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back.

This change fixes test
TestLibtorchAgnosticCPU::test_my_ones_like_cpu
from
cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115
Approved by: https://github.com/huydhn
2025-08-04 13:40:22 +00:00
3e2aa4b0e3 Update pin to include Python 3.14 support (#159725)
Update Triton Pin to top of rel/3.4 branch : https://github.com/triton-lang/triton/tree/rel/3.4 . This is the same as release/3.4.x branch but also includes Python 3.14 support

This should unblock enablement of Python 3.14 support in this PR: https://github.com/pytorch/pytorch/pull/157559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159725
Approved by: https://github.com/davidberard98
2025-08-04 13:30:12 +00:00
6646461764 S390X: fix detection of magic number placeholder in inductor (#157784)
This change fixes multiple tests in
test/inductor/test_aot_inductor_arrayref.py
such as
test_cond_with_parameters_cpu_with_stack_allocation,
test_issue_140766_cpu_with_stack_allocation,
test_model_modified_weights_cpu_with_stack_allocation,
test_nested_tensor_from_jagged_cpu_with_stack_allocation.

Enable tests in test/inductor/test_aot_inductor_arrayref.py

This change is split off from https://github.com/pytorch/pytorch/pull/150116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784
Approved by: https://github.com/huydhn
2025-08-04 12:42:31 +00:00
f74da2a136 [xla hash update] update the pinned xla hash (#159758)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159758
Approved by: https://github.com/pytorchbot
2025-08-04 11:21:45 +00:00
eqy
d35b27dde5 [CUDA] Add some more missing @serialTest decorators (#159672)
Seems to fix #159663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159672
Approved by: https://github.com/Skylion007
2025-08-04 07:44:35 +00:00
a9dc1566d4 [MTIA Aten Backend] Migrate arange.start_out (#159540)
Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540
Approved by: https://github.com/malfet, https://github.com/nautsimon
2025-08-04 07:38:05 +00:00
33a1996714 Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024)
This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024
Approved by: https://github.com/mingfeima
2025-08-04 05:49:46 +00:00
ee62177c19 [dynamo] Be consistent with storing func source for UserMethodVariable (#159696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696
Approved by: https://github.com/jansel
ghstack dependencies: #159534
2025-08-04 05:12:44 +00:00
64cbaa876c [dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534
Approved by: https://github.com/jansel
2025-08-04 05:12:44 +00:00
4516c59f5f [dynamo][source] Add special source for __code__ and __closure__ (#159722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159722
Approved by: https://github.com/jansel
2025-08-04 05:02:05 +00:00
8bc843a9ec [vllm hash update] update the pinned vllm hash (#159610)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159610
Approved by: https://github.com/pytorchbot
2025-08-04 04:06:09 +00:00
e39a62c70d Fix warnings in triton_helpers.py (#159719)
```
  /home/jansel/pytorch/torch/_inductor/runtime/triton_helpers.py:152: UserWarning: Logical operators 'and' and 'or' are deprecated for non-scalar tensors; please use '&' or '|' instead
    equal |= a_isnan and b_isnan
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159719
Approved by: https://github.com/Skylion007
2025-08-04 03:21:09 +00:00
978e3a9142 refresh expected results (#159727)
Just regular update due to recent <10% changes CI is stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727
Approved by: https://github.com/anijain2305
2025-08-03 22:47:50 +00:00
e2a5c42e7e [BE][MPS] Build metal kernels of MacOS-14+ (#159733)
Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary.
Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic.

Part of https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733
Approved by: https://github.com/dcci
ghstack dependencies: #159731, #159732
2025-08-03 20:53:58 +00:00
5116c49b52 [BE] Remove macos-13 guard from bench_mps_ops (#159732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732
Approved by: https://github.com/dcci
ghstack dependencies: #159731
2025-08-03 20:53:58 +00:00
fecdebe385 [CI][MPS] Fix compile benchmark correctness (#159731)
By passing `fullgraph=True` attribute and increasing cache size limit to 2**16

Otherwise, compiler might decide not to fall back to eager to avoid recompilations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731
Approved by: https://github.com/dcci
2025-08-03 20:53:50 +00:00
e136a9175b [BE] Fix dev warning in Dependencies.cmake (#159702)
Namely
```
CMake Warning (dev) in cmake/Dependencies.cmake:
  A logical block opening on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if)

  closes on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif)

  with mis-matching arguments.
```

Introduced by https://github.com/pytorch/pytorch/pull/143846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-03 18:45:07 +00:00
9a680e14b7 [bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723)
The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` .

This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723
Approved by: https://github.com/IvanKobzarev
2025-08-03 09:16:55 +00:00
805a102beb Revert "[dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)"
This reverts commit 1616777cd2a3170ff76afa3e7860b0969420c445.

Reverted https://github.com/pytorch/pytorch/pull/159534 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see 9c18901bfd/1 ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))
2025-08-03 04:58:32 +00:00
6e8d705a22 Revert "[dynamo] Be consistent with storing func source for UserMethodVariable (#159696)"
This reverts commit be71000ff5292293d1976f313218e2df4d5046d3.

Reverted https://github.com/pytorch/pytorch/pull/159696 on behalf of https://github.com/malfet due to Broke some inductor test and lint among other things, see 9c18901bfd/1 ([comment](https://github.com/pytorch/pytorch/pull/159534#issuecomment-3146983186))
2025-08-03 04:58:32 +00:00
9c18901bfd [MTIA Aten Backend] Migrate all.out (#159539)
Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539
Approved by: https://github.com/malfet
ghstack dependencies: #159098
2025-08-03 02:08:35 +00:00
a29ed5e1ac Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-08-02 23:23:17 +00:00
d2792f51b2 [bucketing] Use max of input/output size for bucketing (#159717)
The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717
Approved by: https://github.com/wconstab
2025-08-02 22:42:22 +00:00
be71000ff5 [dynamo] Be consistent with storing func source for UserMethodVariable (#159696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159696
Approved by: https://github.com/jansel
ghstack dependencies: #159186, #159534
2025-08-02 21:40:38 +00:00
3f86076775 gc before warming up benchmarking (#159670)
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen
2025-08-02 19:37:24 +00:00
1616777cd2 [dynamo][guards] Make class members go through obj.__class__.__dict__ (#159534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159534
Approved by: https://github.com/jansel
ghstack dependencies: #159186
2025-08-02 18:04:35 +00:00
38895c0ac2 Update RuntimeError message in is_nonzero(input) method from bool to Boolean (#159712)
RuntimeError message updated in is_nonzero(input) method from bool to Boolean.

**Case 1:**
t = torch.tensor([])
torch.is_nonzero(t)

**Case 2:**
t = torch.tensor([1,2])
torch.is_nonzero(t)

**Existing Error message in documentation:**

for case 1: RuntimeError: bool value of Tensor with no values is ambiguous
for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous

**Proposed Error message in documentation:**

for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous
for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Fixes #159710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712
Approved by: https://github.com/malfet
2025-08-02 17:23:45 +00:00
310f901a71 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-08-02 16:38:54 +00:00
e11b1cd97e [ROCm] fix nightly wheel due to rocBLAS environment variable (#159570)
Fixes #159070

The TunableOp failure is due to missing rocBLAS files in our manywheels packaging. This bug has been present since June 7-8 time frame. It was caused by a typo in the rocBLAS environment variable that stores the list of files. It was introduced in this PR: https://github.com/pytorch/pytorch/pull/155388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159570
Approved by: https://github.com/malfet
2025-08-02 06:54:43 +00:00
b599d91738 Log autotune choices and benchmark result to scuba/chrome trace (#159496)
Summary:
Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels.

**Example**

Event: mm_template_autotuning
Column: autotune_choices

```json
{
  "num_choices": 52,
  "num_triton_choices": 19,
  "best_kernel": "cutlass_f6c25cf2",
  "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8",
  "best_time": 0.6283040046691895,
  "best_triton_pos": 26,
  "best_triton_time": 0.6832960247993469,
  "best_triton_kernel": "triton_mm_17",
  "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"
}
```

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
```

Rollback Plan:

Differential Revision: D79235037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496
Approved by: https://github.com/masnesral
2025-08-02 05:34:17 +00:00
fd6a6658c3 Enable _int_mm on Intel GPU (#157769)
# Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-08-02 05:16:01 +00:00
04973496a8 [audio hash update] update the pinned audio hash (#159611)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159611
Approved by: https://github.com/pytorchbot
2025-08-02 05:15:47 +00:00
1548b011ea Fix rand_like decomposition to preserve strides (#159294)
Summary: Like https://github.com/pytorch/pytorch/pull/158898, the rand_like variants are not preserving strides. Followed the pattern established in https://github.com/pytorch/pytorch/pull/158898.

Test Plan: New unit test (fails before this PR; but fixed after)

Differential Revision: [D79472604](https://our.internmc.facebook.com/intern/diff/D79472604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159294
Approved by: https://github.com/eellison
2025-08-02 03:54:41 +00:00
e57a92734d [export] Fix nn_module_stack of assert_tensor_metadata nodes (#159625)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159625
Approved by: https://github.com/yushangdi
2025-08-02 02:52:42 +00:00
79ff3b320b Back out "[ez] get rid of unused var" (#159677)
Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops.

Test Plan:
backout

Rollback Plan:

Differential Revision: D79474114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677
Approved by: https://github.com/georgiaphillips
2025-08-02 01:50:16 +00:00
426f249f20 Fix launch grid calculation (#159497)
Summary:

The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages.

In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79275534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497
Approved by: https://github.com/blaine-rister
2025-08-02 01:12:58 +00:00
d33a484763 Use boxed_nop_preserve_node_meta for aot_export_joint_with_descriptors (#159545)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159545
Approved by: https://github.com/xmfan, https://github.com/wconstab
ghstack dependencies: #159336, #159337
2025-08-02 00:33:41 +00:00
a81ffbc5f5 improve shape checks for grouped_mm (#159666)
Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666
Approved by: https://github.com/danielvegamyhre, https://github.com/eqy
2025-08-02 00:12:25 +00:00
465fe4d9f7 Enable sample nightly PT2 benchmark on B200 (#158011)
Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200

### Testing

https://github.com/pytorch/pytorch/actions/runs/16615101382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-08-01 23:47:44 +00:00
9477af1063 fix compilation on cuda < 12.3 (#159657)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159657
Approved by: https://github.com/kwen2501
2025-08-01 23:40:55 +00:00
dcc36e38bb [Graph Breaks] Remove unsupported Additional Info field (#159658)
Race condition when landing PR#158800 caused us to add this field when it is deprecated, so remove it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159658
Approved by: https://github.com/williamwen42
2025-08-01 23:25:50 +00:00
efd78584a8 [EZ] Add linux-aarch64.yml workflow to the viable/strict blocking set (#159668)
Since it's required to be run on every PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159668
Approved by: https://github.com/malfet
2025-08-01 23:19:08 +00:00
135762ea20 Unpin helion (#159579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159579
Approved by: https://github.com/jansel
2025-08-01 23:08:06 +00:00
e2ee9cfaa2 [NativeRT] Turn on enableStaticCPUKernels by default (#159422)
Summary: As title.

Test Plan:
Need to manual test on production models.

Rollback Plan:

Differential Revision: D78747742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159422
Approved by: https://github.com/dolpm
2025-08-01 22:27:07 +00:00
06d28de17a Update CK Kernel generation and update ck submodule (#157964)
changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964
Approved by: https://github.com/842974287
2025-08-01 22:24:27 +00:00
df9720b8b5 [MTIA Aten Backend] Migrate all foreach ops (#159098)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate all foreach operators to in-tree, including:
  - _foreach_abs
  - _foreach_abs_
  - _foreach_add.List
  - _foreach_add_.List
  - _foreach_add_.Scalar
  - _foreach_add_.Tensor
  - _foreach_addcmul.Scalar
  - _foreach_addcmul_.Scalar
  - _foreach_copy
  - _foreach_copy_
  - _foreach_mul.List
  - _foreach_mul_.List
  - _foreach_mul_.Scalar
  - _foreach_mul.Tensor
  - _foreach_mul_.Tensor
  - _foreach_norm.Scalar
  - _foreach_sqrt_

Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098
Approved by: https://github.com/malfet
2025-08-01 22:10:12 +00:00
85e74d5ace [inductor] Add logging for distributed collective ops for multi‑rank diagnostics (#159190)
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.

- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
2025-08-01 21:51:42 +00:00
0450f05658 Output tensor meta data for FX graph node (#159311)
FX graph segment in CompiledFxGraph does not include tensor meta data, for example, tensor shape, tensor stride, tensor data type, tensor device. AI system co-design team requested to include these information in FX graph segment so they can use FX graph segment to project the performance on different hardware.
This DIFF is to modify the Graph::Node::format_node to include tensor meta data.
Before this DIFF, the triton kernel FX graph segment looks like the following:
```
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {})
# %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# %cos : cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {})
# return %cos
After this DIFF:
# %mm : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=mm]
# %arg2_1 : Tensor "f32[4, 4][4, 1]cuda:0" = PlaceHolder[target=arg2_1]
# %sin : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%mm,), kwargs = {})
# %permute_1 : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%sin, [1, 0]), kwargs = {})
# %mul : Tensor "f32[4, 4][4, 1]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg2_1, 1111), kwargs = {})
# %add : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %mul), kwargs = {})
# %cos : Tensor "f32[4, 4][1, 4]cuda:0"[num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%add,), kwargs = {})
# return %cos
```
If format_node can not be changed, I can copy the code to caffe2/torch/_inductor/utils.py.

Differential Revision: D77973076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159311
Approved by: https://github.com/angelayi
2025-08-01 21:40:29 +00:00
595a65f5c2 [dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/script_object.py (#159343)
Fixes part of #147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159343
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-08-01 21:30:41 +00:00
8c6c2e40eb Edit a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend (#159542)
As suggested in the pull request #158903 by @H-huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159542
Approved by: https://github.com/d4l3k, https://github.com/H-Huang
2025-08-01 21:20:25 +00:00
32840d19f9 [cutlass backend] skip stream k if shape is dynamic (#159442)
Differential Revision: [D79229210](https://our.internmc.facebook.com/intern/diff/D79229210/)

Motivation is workspace size is hard to determine, and varies for different shape. What I observed is sometimes the shape got smaller, but the workspace can increase. So it is hard to upper bound it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159442
Approved by: https://github.com/ColinPeppler
2025-08-01 20:42:24 +00:00
2040f00112 [BE][Easy] respect os.environ in subprocess calls in tools/nightly.py (#159572)
Respect parent shell's envvars, such as `UV_INDEX_STRATEGY`, `http{,s}_proxy`, etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159572
Approved by: https://github.com/Skylion007
2025-08-01 20:40:31 +00:00
c137f9da0b [Dynamo][Better Engineering] Add type coverage to dynamo/compiled_autograd.py (#159518)
As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/compiled_autograd.py`

Running
```
mypy torch/_dynamo/compiled_autograd.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  425 | 1553 | 27.37% | 17 | 62 | 27.42% |
| This PR | 1623 | 1623 | 100.00% | 62 | 62 | 100.00% |
| Delta    | +1198| +0 | +72.63% | +45 | 0 | +72.58% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159518
Approved by: https://github.com/xmfan
2025-08-01 20:24:58 +00:00
5e8b95605f [PP] Support OVERLAP_F_B computation type (#158978)
Some changes to validation code and visualizer to support a new computation type that will be used in DualPipeV (see https://github.com/pytorch/pytorch/pull/159591)

The IR looks like:

```
[0F0, 0F1, 0F2, 0F3, 0F4, 0F5, 0F6, 7F0, 7I0, 7W0, 7F1, 7I1, 7W1, 7F2, 7I2, 7W2, 7F3, (0F7;7B3)OVERLAP_F_B, (7F4;0B0)OVERLAP_F_B, (0F8;7B4)OVERLAP_F_B, (7F5;0B1)OVERLAP_F_B, (0F9;7B5)OVERLAP_F_B, (7F6;0B2)OVERLAP_F_B, 7B6, (7F7;0B3)OVERLAP_F_B, 7B7, (7F8;0B4)OVERLAP_F_B, 7B8, (7F9;0B5)OVERLAP_F_B, 7B9, 0I6, 0W6, 0I7, 0W7, 0I8, 0W8, 0I9, 0W9]
[1F0, 1F1, 1F2, 1F3, 1F4, 6F0, 1F5, 6F1, 6I0, 6W0, 6F2, 6I1, 6W1, 6F3, (1F6;6B2)OVERLAP_F_B, (6F4;1B0)OVERLAP_F_B, (1F7;6B3)OVERLAP_F_B, (6F5;1B1)OVERLAP_F_B, (1F8;6B4)OVERLAP_F_B, (6F6;1B2)OVERLAP_F_B, (1F9;6B5)OVERLAP_F_B, (6F7;1B3)OVERLAP_F_B, 6B6, (6F8;1B4)OVERLAP_F_B, 6B7, (6F9;1B5)OVERLAP_F_B, 6B8, 1B6, 6I9, 1I7, 6W9, 1I8, 1W7, 1I9, 1W8, 1W9]
[2F0, 2F1, 2F2, 5F0, 2F3, 5F1, 2F4, 5F2, 5I0, 5W0, 5F3, (2F5;5B1)OVERLAP_F_B, (5F4;2B0)OVERLAP_F_B, (2F6;5B2)OVERLAP_F_B, (5F5;2B1)OVERLAP_F_B, (2F7;5B3)OVERLAP_F_B, (5F6;2B2)OVERLAP_F_B, (2F8;5B4)OVERLAP_F_B, (5F7;2B3)OVERLAP_F_B, (2F9;5B5)OVERLAP_F_B, (5F8;2B4)OVERLAP_F_B, 5B6, (5F9;2B5)OVERLAP_F_B, 5B7, 2B6, 5B8, 2I7, 5I9, 2I8, 2W7, 2I9, 5W9, 2W8, 2W9]
[3F0, 4F0, 3F1, 4F1, 3F2, 4F2, 3F3, 4F3, 3F4, 4B0, (4F4;3B0)OVERLAP_F_B, (3F5;4B1)OVERLAP_F_B, (4F5;3B1)OVERLAP_F_B, (3F6;4B2)OVERLAP_F_B, (4F6;3B2)OVERLAP_F_B, (3F7;4B3)OVERLAP_F_B, (4F7;3B3)OVERLAP_F_B, (3F8;4B4)OVERLAP_F_B, (4F8;3B4)OVERLAP_F_B, (3F9;4B5)OVERLAP_F_B, (4F9;3B5)OVERLAP_F_B, 4B6, 3B6, 4B7, 3B7, 4I8, 3I8, 4I9, 3I9, 4W8, 3W8, 4W9, 3W9]
```

In this PR, the schedule execution will just treat the OVERLAP_F_B as two separate operations of F and B (so there is no actual overlap). The next step is to allow users to create a custom function to plug in what this operation does.

814629043a/torch/distributed/pipelining/schedules.py (L1205-L1216)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158978
Approved by: https://github.com/wconstab
2025-08-01 20:22:30 +00:00
8ea86a6e31 Actually test STD_TORCH_CHECK, add testfile to CMake (#159603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159603
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-08-01 19:53:41 +00:00
acad808545 Revert "[inductor] consolidate common GEMM triton param retrieval (#159383)"
This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443.

Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))
2025-08-01 19:49:21 +00:00
c687446374 Revert "Fix rand_like decomposition to preserve strides (#159294)"
This reverts commit 2c46922ce4b33c39b1c48c302604805510a3f889.

Reverted https://github.com/pytorch/pytorch/pull/159294 on behalf of https://github.com/yangw-dev due to breaking internal test ([comment](https://github.com/pytorch/pytorch/pull/159294#issuecomment-3145541845))
2025-08-01 19:19:51 +00:00
dd22ba09b4 [C10D] Document barrier interaction with device_id (#159389)
Addresses #159262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389
Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj
2025-08-01 18:12:21 +00:00
c0e0126399 Remove unused input parameter in ExpandableSegment (#159356)
# Motivation
While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.

# Additional Context
I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356
Approved by: https://github.com/ngimel, https://github.com/albanD
ghstack dependencies: #159159
2025-08-01 17:47:51 +00:00
e4b123b5e4 Revert direct updates (#159654)
reverts:
```

commit 5711a8f06948eeee56ed5f53f171fa519f78491c (tag: trunk/5711a8f06948eeee56ed5f53f171fa519f78491c, origin/main, main)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:32:52 2025 -0700

    Update test_utils.py

commit b4b71d011ed07a41c2086ff0dec2988a63662877 (tag: trunk/b4b71d011ed07a41c2086ff0dec2988a63662877)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:27:54 2025 -0700

    Update utils.py

commit 52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d (tag: trunk/52376b9b6fbf9fe24f5d82038dc520f0c64b6f8d)
Author: Jovian Anthony Jaison <38627145+jovianjaison@users.noreply.github.com>
Date:   Fri Aug 1 09:26:05 2025 -0700
```

(commits pushed directly to main by mistake)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159654
Approved by: https://github.com/atalman
2025-08-01 16:54:51 +00:00
5711a8f069 Update test_utils.py 2025-08-01 09:32:52 -07:00
b4b71d011e Update utils.py 2025-08-01 09:27:54 -07:00
52376b9b6f Update convert_frame.py 2025-08-01 09:26:05 -07:00
1371a98b0e Migrate ScalarType to headeronly (#159416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159416
Approved by: https://github.com/albanD
ghstack dependencies: #159415, #159411
2025-08-01 16:07:01 +00:00
2330 changed files with 123172 additions and 93950 deletions

15
.bc-linter.yml Normal file
View File

@ -0,0 +1,15 @@
version: 1
paths:
include:
- "**/*.py"
exclude:
- ".*"
- ".*/**"
- "**/.*/**"
- "**/.*"
- "**/_*/**"
- "**/_*.py"
- "**/test/**"
- "**/benchmarks/**"
- "**/test_*.py"
- "**/*_test.py"

View File

@ -3,8 +3,20 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
# Set CUDA architecture lists to match x86 build_cuda.sh
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ "$DESIRED_CUDA" == *"13"* ]]; then
export TORCH_NVCC_FLAGS="-compress-mode=size"
# Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801
export BUILD_BUNDLE_PTXAS=1
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
@ -18,7 +30,7 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
@ -26,6 +38,16 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
# Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling CUDA libraries with wheel for aarch64."
else
echo "Using nvidia libs from pypi for aarch64."
echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"
export USE_NVIDIA_PYPI_LIBS=1
fi
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -69,61 +69,186 @@ def replace_tag(filename) -> None:
f.writelines(lines)
def patch_library_rpath(
folder: str,
lib_name: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Apply patchelf to set RPATH for a library in torch/lib"""
lib_path = f"{folder}/tmp/torch/lib/{lib_name}"
if use_nvidia_pypi_libs:
# For PyPI NVIDIA libraries, construct CUDA RPATH
cuda_rpaths = [
"$ORIGIN/../../nvidia/cudnn/lib",
"$ORIGIN/../../nvidia/nvshmem/lib",
"$ORIGIN/../../nvidia/nccl/lib",
"$ORIGIN/../../nvidia/cusparselt/lib",
]
if "130" in desired_cuda:
cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")
else:
cuda_rpaths.extend(
[
"$ORIGIN/../../nvidia/cublas/lib",
"$ORIGIN/../../nvidia/cuda_cupti/lib",
"$ORIGIN/../../nvidia/cuda_nvrtc/lib",
"$ORIGIN/../../nvidia/cuda_runtime/lib",
"$ORIGIN/../../nvidia/cufft/lib",
"$ORIGIN/../../nvidia/curand/lib",
"$ORIGIN/../../nvidia/cusolver/lib",
"$ORIGIN/../../nvidia/cusparse/lib",
"$ORIGIN/../../nvidia/nvtx/lib",
"$ORIGIN/../../nvidia/cufile/lib",
]
)
# Add $ORIGIN for local torch libs
rpath = ":".join(cuda_rpaths) + ":$ORIGIN"
else:
# For bundled libraries, just use $ORIGIN
rpath = "$ORIGIN"
if os.path.exists(lib_path):
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"
)
def copy_and_patch_library(
src_path: str,
folder: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Copy a library to torch/lib and patch its RPATH"""
if os.path.exists(src_path):
lib_name = os.path.basename(src_path)
shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
libs_to_copy = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusparse.so.12",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
# Delete original wheel since it will be repackaged
os.system(f"rm {wheel_path}")
if "129" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
# Check if we should use PyPI NVIDIA libraries or bundle system libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Using nvidia libs from pypi - skipping CUDA library bundling")
# For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages
# We only need to bundle non-NVIDIA libraries
minimal_libs_to_copy = [
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
# Copy minimal libraries to unzipped_folder/torch/lib
for lib_path in minimal_libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Patch torch libraries used for searching libraries
torch_libs_to_patch = [
"libtorch.so",
"libtorch_cpu.so",
"libtorch_cuda.so",
"libtorch_cuda_linalg.so",
"libtorch_global_deps.so",
"libtorch_python.so",
"libtorch_nvshmem.so",
"libc10.so",
"libc10_cuda.so",
"libcaffe2_nvrtc.so",
"libshm.so",
]
for lib_name in torch_libs_to_patch:
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
else:
print("Bundling CUDA libraries with wheel")
# Original logic for bundling system CUDA libraries
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# CUDA version-specific libraries
if "13" in desired_cuda:
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
]
else:
raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy libraries to unzipped_folder/torch/lib
for lib_path in libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
@ -131,14 +256,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
replace_tag(f"{f.path}/WHEEL")
break
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
f"{folder}/cuda_wheel/{wheelname}",
f"{folder}/{wheelname}",
copy_function=shutil.copy2,
)
os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")
os.system(f"wheel pack {folder}/tmp/ -d {folder}")
os.system(f"rm -rf {folder}/tmp/")
def complete_wheel(folder: str) -> str:
@ -161,14 +280,7 @@ def complete_wheel(folder: str) -> str:
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = wheel_name.replace(
"linux_aarch64", "manylinux_2_28_aarch64"
)
print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")
os.rename(
f"/{folder}/dist/{wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
repaired_wheel_name = list_dir(f"/{folder}/dist")[0]
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
@ -208,7 +320,17 @@ if __name__ == "__main__":
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars = "MAX_JOBS=5 " + build_vars
build_vars += "MAX_JOBS=5 "
# Handle PyPI NVIDIA libraries vs bundled libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Configuring build for PyPI NVIDIA libraries")
# Configure for dynamic linking (matching x86 logic)
build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "
else:
print("Configuring build for bundled NVIDIA libraries")
# Keep existing static linking approach - already configured above
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")

View File

@ -438,9 +438,7 @@ def build_torchvision(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
@ -495,9 +493,7 @@ def build_torchdata(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
@ -553,9 +549,7 @@ def build_torchtext(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
@ -613,9 +607,7 @@ def build_torchaudio(
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += (
f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-')[0]}"
)
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"

View File

@ -120,8 +120,8 @@ If your new Docker image needs a library installed from a specific pinned commit
If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:
```bash
docker build \
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
....
--build-arg "NEW_ARG_1=${NEW_ARG_1}"
```
3. **Update Dockerfile logic**:

View File

@ -64,6 +64,10 @@ FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
ENV DESIRED_CUDA=12.9
FROM cuda as cuda13.0
RUN bash ./install_cuda.sh 13.0
ENV DESIRED_CUDA=13.0
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ADD ./common/install_mkl.sh install_mkl.sh
@ -76,10 +80,10 @@ ADD ./common/install_mnist.sh install_mnist.sh
RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
COPY --from=cuda12.8 /usr/local/cuda-12.8 /usr/local/cuda-12.8
COPY --from=cuda12.9 /usr/local/cuda-12.9 /usr/local/cuda-12.9
COPY --from=cuda13.0 /usr/local/cuda-13.0 /usr/local/cuda-13.0
# Final step
FROM ${BASE_TARGET} as final

View File

@ -76,10 +76,13 @@ elif [[ "$image" == *cuda*linter* ]]; then
elif [[ "$image" == *linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter/Dockerfile"
elif [[ "$image" == *riscv* ]]; then
# Use RISC-V specific Dockerfile
DOCKERFILE="ubuntu-cross-riscv/Dockerfile"
fi
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152
_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
@ -111,6 +114,16 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11)
CUDA_VERSION=13.0.0
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.10
@ -122,38 +135,6 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.12
@ -164,39 +145,6 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
ANACONDA_PYTHON_VERSION=3.10
@ -208,30 +156,18 @@ case "$tag" in
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-onnx)
ANACONDA_PYTHON_VERSION=3.9
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
VISION=yes
ONNX=yes
;;
pytorch-linux-jammy-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-py3.10-clang12)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.11-clang12)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-noble-rocm-n-py3)
pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)
if [[ $tag =~ "jammy" ]]; then
ANACONDA_PYTHON_VERSION=3.10
else
@ -245,7 +181,9 @@ case "$tag" in
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-noble-rocm-alpha-py3)
ANACONDA_PYTHON_VERSION=3.12
@ -257,27 +195,26 @@ case "$tag" in
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.0
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-xpu-2025.1-py3)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-xpu-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.1
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-xpu-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.2
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
@ -285,8 +222,8 @@ case "$tag" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-clang12)
ANACONDA_PYTHON_VERSION=3.10
CUDA_VERSION=12.8.1
CLANG_VERSION=12
VISION=yes
@ -297,8 +234,8 @@ case "$tag" in
CLANG_VERSION=18
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
pytorch-linux-jammy-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
@ -325,13 +262,10 @@ case "$tag" in
TRITON_CPU=yes
;;
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
PYTHON_VERSION=3.10
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter)
PYTHON_VERSION=3.10
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
@ -339,7 +273,6 @@ case "$tag" in
GCC_VERSION=11
ACL=yes
VISION=yes
CONDA_CMAKE=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
@ -350,13 +283,15 @@ case "$tag" in
GCC_VERSION=11
ACL=yes
VISION=yes
CONDA_CMAKE=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-noble-riscv64-py3.12-gcc14)
GCC_VERSION=14
;;
*)
# Catch-all for builds that are not hardcoded.
VISION=yes
@ -477,7 +412,14 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
fi
if [ -n "$GCC_VERSION" ]; then
if !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then
if [[ "$image" == *riscv* ]]; then
# Check RISC-V cross-compilation toolchain version
if !(drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version 2>&1 | grep -q " $GCC_VERSION\\W"); then
echo "RISC-V GCC_VERSION=$GCC_VERSION, but:"
drun riscv64-linux-gnu-gcc-${GCC_VERSION} --version
exit 1
fi
elif !(drun gcc --version 2>&1 | grep -q " $GCC_VERSION\\W"); then
echo "GCC_VERSION=$GCC_VERSION, but:"
drun gcc --version
exit 1

View File

@ -0,0 +1,2 @@
transformers==4.54.0
soxr==0.5.0

View File

@ -1 +0,0 @@
243e186efbf7fb93328dd6b34927a4e8c8f24395

View File

@ -0,0 +1 @@
v2.27.7-1

View File

@ -0,0 +1 @@
74a23feff57432129df84d8099e622773cf77925

View File

@ -1 +1 @@
ae324eeac8e102a2b40370e341460f3791353398
1b0418a9a454b2b93ab8d71f40e59d2297157fae

View File

@ -1 +1 @@
11ec6354315768a85da41032535e3b7b99c5f706
bbb06c0334a6772b92d24bde54956e675c8c6604

View File

@ -66,8 +66,9 @@ function do_cpython_build {
ln -s pip3 ${prefix}/bin/pip
fi
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.45.1 setuptools==80.9.0
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
# packaging is needed to create symlink since wheel no longer provides needed information
${prefix}/bin/pip install packaging==25.0 wheel==0.45.1 setuptools==80.9.0
local abi_tag=$(${prefix}/bin/python -c "from packaging.tags import interpreter_name, interpreter_version; import sysconfig ; from sysconfig import get_config_var; print('{0}{1}-{0}{1}{2}'.format(interpreter_name(), interpreter_version(), 't' if sysconfig.get_config_var('Py_GIL_DISABLED') else ''))")
ln -sf ${prefix} /opt/python/${abi_tag}
}
@ -82,9 +83,9 @@ function build_cpython {
py_suffix=${py_ver::-1}
py_folder=$py_suffix
fi
# Only b3 is available now
# Update to rc2 due to https://github.com/python/cpython/commit/c72699086fe4
if [ "$py_suffix" == "3.14.0" ]; then
py_suffix="3.14.0b3"
py_suffix="3.14.0rc2"
fi
wget -q $PYTHON_DOWNLOAD_URL/$py_folder/Python-$py_suffix.tgz -O Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_suffix

View File

@ -10,7 +10,7 @@ else
arch_path='sbsa'
fi
NVSHMEM_VERSION=3.3.9
NVSHMEM_VERSION=3.3.24
function install_cuda {
version=$1
@ -62,14 +62,16 @@ function install_nvshmem {
mkdir -p "${tmpdir}" && cd "${tmpdir}"
# nvSHMEM license: https://docs.nvidia.com/nvshmem/api/sla.html
filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"
# This pattern is a lie as it is not consistent across versions, for 3.3.9 it was cuda_ver-arch-nvshhem-ver
filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"
suffix=".tar.xz"
url="https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-${arch_path}/${filename}${suffix}"
# download, unpack, install
wget -q "${url}"
tar xf "${filename}.tar.gz"
cp -a "libnvshmem/include/"* /usr/local/include/
cp -a "libnvshmem/lib/"* /usr/local/lib/
tar xf "${filename}${suffix}"
cp -a "${filename}/include/"* /usr/local/cuda/include/
cp -a "${filename}/lib/"* /usr/local/cuda/lib64/
# cleanup
cd ..
@ -126,74 +128,6 @@ function install_129 {
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
@ -212,18 +146,38 @@ function install_128 {
ldconfig
}
function install_130 {
CUDNN_VERSION=9.13.0.50
echo "Installing CUDA 13.0 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 13.0 in the same container
install_cuda 13.0.0 cuda_13.0.0_580.65.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 13 $CUDNN_VERSION
install_nvshmem 13 $NVSHMEM_VERSION
CUDA_VERSION=13.0 bash install_nccl.sh
CUDA_VERSION=13.0 bash install_cusparselt.sh
ldconfig
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
12.4) install_124;
;;
12.6|12.6.*) install_126; prune_126
12.6|12.6.*) install_126;
;;
12.8|12.8.*) install_128;
;;
12.9|12.9.*) install_129;
;;
13.0|13.0.*) install_130;
;;
*) echo "bad argument $1"; exit 1
;;
esac

View File

@ -5,7 +5,15 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ "13" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.8.0.4_cuda13-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

View File

@ -5,9 +5,7 @@ set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
function install_huggingface() {
local version
commit=$(get_pinned_commit huggingface)
pip_install "git+https://github.com/huggingface/transformers@${commit}"
pip_install -r huggingface-requirements.txt
}
function install_timm() {
@ -15,11 +13,34 @@ function install_timm() {
commit=$(get_pinned_commit timm)
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y torch torchvision triton
}
function install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
python install.py --continue_on_fail
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
chown -R jenkins torchbench
chown -R jenkins /opt/conda
}
# Pango is needed for weasyprint which is needed for doctr
conda_install pango
# Stable packages are ok here, just to satisfy TorchBench check
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
install_torchbench
install_huggingface
install_timm
# Clean up
conda_run pip uninstall -y torch torchvision torchaudio triton torchao

View File

@ -7,6 +7,8 @@ if [[ ${CUDA_VERSION:0:2} == "11" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu11.txt)
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu12.txt)
elif [[ ${CUDA_VERSION:0:2} == "13" ]]; then
NCCL_VERSION=$(cat ci_commit_pins/nccl-cu13.txt)
else
echo "Unexpected CUDA_VERSION ${CUDA_VERSION}"
exit 1

View File

@ -19,8 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnxscript==0.3.1
pip_install onnxruntime==1.22.1
pip_install onnxscript==0.4.0
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -57,7 +57,7 @@ if [ ! -f setup.py ]; then
cd python
fi
pip_install pybind11==2.13.6
pip_install pybind11==3.0.1
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py
@ -103,5 +103,5 @@ fi
# It depends on torch and triton. We don't want to install
# triton and torch from production on Docker CI images
if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then
pip_install helion==0.0.10 --no-deps
pip_install helion --no-deps
fi

View File

@ -44,8 +44,12 @@ function install_ucc() {
./autogen.sh
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
if [[ -n "$CUDA_VERSION" && $CUDA_VERSION == 13* ]]; then
NVCC_GENCODE="-gencode=arch=compute_86,code=compute_86"
else
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
fi
if [[ -n "$ROCM_VERSION" ]]; then
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then

View File

@ -34,18 +34,27 @@ function install_ubuntu() {
# The xpu-smi packages
apt-get install -y flex bison xpu-smi
# Compute and Media Runtimes
apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
apt-get install -y intel-ocloc
if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then
# Compute and Media Runtimes
apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
else # rolling driver
apt-get install -y \
intel-opencl-icd libze-intel-gpu1 libze1 \
intel-media-va-driver-non-free libmfx-gen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo intel-ocloc
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev libze-dev
fi
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
apt-get install -y ${XPU_PACKAGES}
@ -56,10 +65,14 @@ function install_ubuntu() {
function install_rhel() {
. /etc/os-release
if [[ ! " 8.8 8.10 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
if [[ "${ID}" == "rhel" ]]; then
if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
elif [[ "${ID}" == "almalinux" ]]; then
# Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64
VERSION_ID="8.8"
fi
dnf install -y 'dnf-command(config-manager)'
@ -130,18 +143,18 @@ function install_sles() {
}
# Default use GPU driver LTS releases
XPU_DRIVER_VERSION="/lts/2350"
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
# Use GPU driver rolling releases
XPU_DRIVER_VERSION=""
# Default use GPU driver rolling releases
XPU_DRIVER_VERSION=""
if [[ "${XPU_DRIVER_TYPE,,}" == "lts" ]]; then
# Use GPU driver LTS releases
XPU_DRIVER_VERSION="/lts/2350"
fi
# Default use Intel® oneAPI Deep Learning Essentials 2025.0
if [[ "$XPU_VERSION" == "2025.1" ]]; then
XPU_PACKAGES="intel-deep-learning-essentials-2025.1"
# Default use Intel® oneAPI Deep Learning Essentials 2025.1
if [[ "$XPU_VERSION" == "2025.2" ]]; then
XPU_PACKAGES="intel-deep-learning-essentials-2025.2"
else
XPU_PACKAGES="intel-deep-learning-essentials-2025.0"
XPU_PACKAGES="intel-deep-learning-essentials-2025.1"
fi
# The installation depends on the base OS

View File

@ -0,0 +1,9 @@
#!/bin/bash
set -xe
# Script used in Linux x86 and aarch64 CD pipeline
# Workaround for exposing statically linked libstdc++ CXX11 ABI symbols.
# see: https://github.com/pytorch/pytorch/issues/133437
LIBNONSHARED=$(gcc -print-file-name=libstdc++_nonshared.a)
nm -g $LIBNONSHARED | grep " T " | grep recursive_directory_iterator | cut -c 20- > weaken-symbols.txt
objcopy --weaken-symbols weaken-symbols.txt $LIBNONSHARED $LIBNONSHARED

View File

@ -69,6 +69,19 @@ RUN bash ./install_cuda.sh 12.9
RUN bash ./install_magma.sh 12.9
RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda
FROM cuda as cuda13.0
RUN bash ./install_cuda.sh 13.0
RUN bash ./install_magma.sh 13.0
RUN ln -sf /usr/local/cuda-13.0 /usr/local/cuda
# Install libibverbs for libtorch and copy to CUDA directory
RUN apt-get update -y && \
apt-get install -y libibverbs-dev librdmacm-dev && \
cp /usr/lib/x86_64-linux-gnu/libmlx5.so* /usr/local/cuda/lib64/ && \
cp /usr/lib/x86_64-linux-gnu/librdmacm.so* /usr/local/cuda/lib64/ && \
cp /usr/lib/x86_64-linux-gnu/libibverbs.so* /usr/local/cuda/lib64/ && \
cp /usr/lib/x86_64-linux-gnu/libnl* /usr/local/cuda/lib64/
FROM cpu as rocm
ARG ROCM_VERSION
ARG PYTORCH_ROCM_ARCH

View File

@ -130,7 +130,8 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/op
RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
/opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
done;
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
# cmake-3.18.4 from pip; force in case cmake3 already exists
RUN yum install -y python3-pip && \
@ -175,6 +176,6 @@ ENV XPU_DRIVER_TYPE ROLLING
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_xpu.sh install_xpu.sh
ENV XPU_VERSION 2025.1
ENV XPU_VERSION 2025.2
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -71,3 +71,5 @@ RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

View File

@ -95,3 +95,5 @@ COPY --from=nvpl /opt/nvpl/lib/ /usr/local/lib/
COPY --from=nvpl /opt/nvpl/include/ /usr/local/include/
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

View File

@ -67,6 +67,12 @@ case ${image} in
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
manylinux2_28-builder:cuda13*)
TARGET=cuda_final
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
manylinuxaarch64-builder:cuda*)
TARGET=cuda_final
GPU_IMAGE=amd64/almalinux:8

View File

@ -63,11 +63,12 @@ lark==0.12.0
#Pinned versions: 0.12.0
#test that import:
librosa>=0.6.2 ; python_version < "3.11"
librosa==0.10.2 ; python_version == "3.12"
librosa>=0.6.2 ; python_version < "3.11" and platform_machine != "s390x"
librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
#Description: A python package for music and audio analysis
#Pinned versions: >=0.6.2
#test that import: test_spectral_ops.py
#librosa depends on numba; disable it for s390x while numba is disabled too
#mkl #this breaks linux-bionic-rocm4.5-py3.7
#Description: Intel oneAPI Math Kernel Library
@ -92,8 +93,9 @@ librosa==0.10.2 ; python_version == "3.12"
#Pinned versions:
#test that import:
mypy==1.16.0
mypy==1.16.0 ; platform_system != "Windows"
# Pin MyPy version because new errors are likely to appear with each release
# Skip on Windows as lots of type annotations are POSIX specific
#Description: linter
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
@ -110,14 +112,15 @@ ninja==1.11.1.3
#Pinned versions: 1.11.1.3
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9"
numba==0.55.2 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.10"
numba==0.60.0 ; python_version == "3.12"
numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#test that import: test_numba_integration.py
#For numba issue see https://github.com/pytorch/pytorch/issues/51511
#Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
#numpy
#Description: Provides N-dimensional arrays and linear algebra
@ -261,11 +264,6 @@ scipy==1.14.1 ; python_version >= "3.12"
#Pinned versions:
#test that import:
tb-nightly==2.13.0a20230426
#Description: TensorBoard
#Pinned versions:
#test that import:
# needed by torchgen utils
typing-extensions>=4.10.0
#Description: type hints for python
@ -307,7 +305,7 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.15.1.0
z3-solver==4.15.1.0 ; platform_machine != "s390x"
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
@ -342,7 +340,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.3.1
onnxscript==0.4.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -361,7 +359,6 @@ pwlf==2.2.1
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
pyyaml
pyzstd
@ -383,7 +380,7 @@ dataclasses_json==0.6.7
cmake==4.0.0
#Description: required for building
tlparse==0.3.30
tlparse==0.4.0
#Description: required for log parsing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"

View File

@ -1,7 +1,7 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@722b7e6f9ca512fcc526ad07d62b3d28c50bb6cd#egg=pytorch_sphinx_theme2
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably

View File

@ -1 +1 @@
3.4.0
3.5.0

View File

@ -1 +1 @@
3.4.0
3.5.0

View File

@ -0,0 +1,155 @@
# Cross-compilation Docker container for RISC-V architecture
ARG UBUNTU_VERSION
FROM --platform=linux/amd64 ubuntu:${UBUNTU_VERSION} as base
ARG UBUNTU_VERSION
ENV GCC_VERSION=14
ENV PYTHON_VERSION=3.12.3
ENV DEBIAN_FRONTEND=noninteractive
ENV CC=riscv64-linux-gnu-gcc-${GCC_VERSION}
ENV CXX=riscv64-linux-gnu-g++-${GCC_VERSION}
ENV QEMU_LD_PREFIX=/usr/riscv64-linux-gnu/
ENV SYSROOT=/opt/sysroot
# Install basic dependencies
RUN apt-get update && apt-get install -y \
ninja-build \
autoconf \
automake \
libtool \
patchelf \
ccache \
git \
wget \
python3-pip \
python3-venv \
python-is-python3 \
cmake \
sudo \
lsb-release \
gcc-${GCC_VERSION}-riscv64-linux-gnu \
g++-${GCC_VERSION}-riscv64-linux-gnu \
pkg-config \
&& rm -rf /var/lib/apt/lists/*
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
FROM base as python
ARG ZLIB_VERSION=1.3.1
ARG FFI_VERSION=3.4.6
ARG BZ2_VERSION=1.0.8
ARG XZ_VERSION=5.4.6
ARG OPENSSL_VERSION=3.2.1
# Set up sysroot directory for dependencies
ENV PKG_CONFIG_PATH=${SYSROOT}/lib/pkgconfig
ENV PKG_CONFIG_SYSROOT_DIR=${SYSROOT}
WORKDIR /opt
# Build zlib (for compression)
RUN echo "--- Building zlib ---" \
&& wget -c https://www.zlib.net/zlib-${ZLIB_VERSION}.tar.gz \
&& tar -xf zlib-${ZLIB_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd zlib-${ZLIB_VERSION}/ \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} \
&& make -j$(nproc) && make install \
&& cd ../..
# Build libffi (for ctypes module)
RUN echo "--- Building libffi ---" \
&& wget -c https://github.com/libffi/libffi/releases/download/v${FFI_VERSION}/libffi-${FFI_VERSION}.tar.gz \
&& tar -xf libffi-${FFI_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd libffi-${FFI_VERSION}/ \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \
&& make -j$(nproc) && make install \
&& cd ../..
# Build bzip2 (for bz2 module)
RUN echo "--- Building bzip2 ---" \
&& wget -c https://sourceware.org/pub/bzip2/bzip2-${BZ2_VERSION}.tar.gz \
&& tar -xf bzip2-${BZ2_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd bzip2-${BZ2_VERSION}/ \
&& make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} bzip2 bzip2recover libbz2.a \
&& make CC=riscv64-linux-gnu-gcc-${GCC_VERSION} -f Makefile-libbz2_so \
&& make install PREFIX=${SYSROOT} \
&& cp libbz2.so.${BZ2_VERSION} ${SYSROOT}/lib/ \
&& cd ${SYSROOT}/lib/ \
&& ln -sf libbz2.so.${BZ2_VERSION} libbz2.so.1.0 \
&& ln -sf libbz2.so.1.0 libbz2.so \
&& cd /opt/
# Build xz (for lzma module)
RUN echo "--- Building xz ---" \
&& wget -c https://github.com/tukaani-project/xz/releases/download/v${XZ_VERSION}/xz-${XZ_VERSION}.tar.gz \
&& tar -xf xz-${XZ_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd xz-${XZ_VERSION} \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \
&& make -j$(nproc) && make install \
&& cd ../..
# Build OpenSSL (for ssl module)
RUN echo "--- Building OpenSSL ---" \
&& wget -c https://www.openssl.org/source/openssl-${OPENSSL_VERSION}.tar.gz \
&& tar -xf openssl-${OPENSSL_VERSION}.tar.gz --no-same-permissions --no-same-owner \
&& cd openssl-${OPENSSL_VERSION}/ \
&& mkdir build && cd build \
&& ../Configure linux64-riscv64 --prefix=${SYSROOT} \
&& make -j$(nproc) && make install_sw \
&& cd ../..
# Build SQLite3 (for sqlite3 module)
RUN echo "--- Building SQLite3 ---" \
&& wget -c https://www.sqlite.org/2024/sqlite-autoconf-3450200.tar.gz \
&& tar -xf sqlite-autoconf-3450200.tar.gz --no-same-permissions --no-same-owner \
&& cd sqlite-autoconf-3450200 \
&& mkdir build && cd build \
&& ../configure --prefix=${SYSROOT} --host=riscv64-linux-gnu --build=x86_64-linux-gnu \
&& make -j$(nproc) && make install \
&& cd ../..
# Build and install RISC-V Python with all modules
RUN wget -c https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \
&& tar -xf Python-${PYTHON_VERSION}.tgz --no-same-permissions --no-same-owner \
&& cd Python-${PYTHON_VERSION} \
&& mkdir build && cd build \
&& ../configure \
--host=riscv64-linux-gnu \
--build=x86_64-linux-gnu \
--prefix=${SYSROOT} \
--enable-shared \
--disable-ipv6 \
--with-build-python=/usr/bin/python3 \
--with-ensurepip=no \
ac_cv_file__dev_ptmx=yes \
ac_cv_file__dev_ptc=no \
&& make -j$(nproc) \
&& make install
FROM base as final
COPY --from=python /opt/sysroot /opt/sysroot
# Install crossenv and cmake
RUN pip install crossenv cmake==4.0.0 --break-system-packages \
&& /usr/bin/python3 -m crossenv ${SYSROOT}/bin/python3 /opt/riscv-cross-env
# Add pip-installed cmake binaries to PATH
ENV PATH="/usr/local/bin:${PATH}"
# Set up cross Python environment
SHELL ["/bin/bash", "-c"]
RUN source /opt/riscv-cross-env/bin/activate \
&& pip install setuptools pyyaml typing_extensions wheel
# Set default environment variables for PyTorch build
ENV Python_ROOT_DIR=${SYSROOT}
ENV OPENSSL_ROOT_DIR=${SYSROOT}
USER jenkins
CMD ["bash"]

View File

@ -96,10 +96,11 @@ ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt
# (optional) Install non-default Ninja version
ARG NINJA_VERSION

View File

@ -56,10 +56,10 @@ RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt
# Install XPU Dependencies
ARG XPU_VERSION

View File

@ -66,6 +66,7 @@ ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ARG CUDA_VERSION
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
@ -96,10 +97,11 @@ RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt
ARG TRITON
ARG TRITON_CPU
@ -180,7 +182,6 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV CUDA_PATH /usr/local/cuda

View File

@ -7,4 +7,4 @@ set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh
USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.10" ${SCRIPTPATH}/../manywheel/build.sh

31
.ci/lumen_cli/README.md Normal file
View File

@ -0,0 +1,31 @@
# 🔧 Lumen_cli
A Python CLI tool for building and testing PyTorch-based components, using a YAML configuration file for structured, repeatable workflows.
## Features
- **Build**
- external projects (e.g. vLLM)
## 📦 Installation
at the root of the pytorch repo
```bash
pip install -e .ci/lumen_cli
```
## Run the cli tool
The cli tool must be used at root of pytorch repo, as example to run build external vllm:
```bash
python -m cli.run build external vllm
```
this will run the build steps with default behaviour for vllm project.
to see help messages, run
```bash
python3 -m cli.run --help
```
## Add customized external build logics
To add a new external build, for instance, add a new external build logics:
1. create the build function in cli/lib folder
2. register your target and the main build function at EXTERNAL_BUILD_TARGET_DISPATCH in `cli/build_cli/register_build.py`
3. [optional] create your ci config file in .github/ci_configs/${EXTERNAL_PACKAGE_NAME}.yaml

View File

@ -0,0 +1,37 @@
import argparse
import logging
from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec
from cli.lib.core.vllm.vllm_build import VllmBuildRunner
logger = logging.getLogger(__name__)
# Maps targets to their argparse configuration and runner
# it adds new target to path python -m cli.run build external {target} with buildrunner
_TARGETS: dict[str, TargetSpec] = {
"vllm": {
"runner": VllmBuildRunner,
"help": "Build vLLM using docker buildx.",
}
# add yours ...
}
def register_build_commands(subparsers: argparse._SubParsersAction) -> None:
build_parser = subparsers.add_parser(
"build",
help="Build related commands",
formatter_class=RichHelp,
)
build_subparsers = build_parser.add_subparsers(dest="build_command", required=True)
overview = "\n".join(
f" {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()
)
external_parser = build_subparsers.add_parser(
"external",
help="Build external targets",
description="Build third-party targets.\n\nAvailable targets:\n" + overview,
formatter_class=RichHelp,
)
register_targets(external_parser, _TARGETS)

View File

@ -0,0 +1,71 @@
"""
Cli Argparser Utility helpers for CLI tasks.
"""
import argparse
from abc import ABC, abstractmethod
try:
from typing import Any, Callable, Required, TypedDict # Python 3.11+
except ImportError:
from typing import Any, Callable, TypedDict
from typing_extensions import Required # Fallback for Python <3.11
class BaseRunner(ABC):
def __init__(self, args: Any) -> None:
self.args = args
@abstractmethod
def run(self) -> None:
"""runs main logics, required"""
# Pretty help: keep newlines + show defaults
class RichHelp(
argparse.ArgumentDefaultsHelpFormatter, argparse.RawDescriptionHelpFormatter
):
pass
class TargetSpec(TypedDict, total=False):
"""CLI subcommand specification with bA."""
runner: Required[type[BaseRunner]]
help: str
description: str
add_arguments: Callable[[argparse.ArgumentParser], None]
def register_targets(
parser: argparse.ArgumentParser,
target_specs: dict[str, TargetSpec],
common_args: Callable[[argparse.ArgumentParser], None] = lambda _: None,
) -> None:
"""Register target subcommands."""
targets = parser.add_subparsers(
dest="target",
required=True,
metavar="{" + ",".join(target_specs.keys()) + "}",
)
for name, spec in target_specs.items():
desc = spec.get("description") or spec["runner"].__doc__ or ""
p = targets.add_parser(
name,
help=spec.get("help", ""),
description=desc.strip(),
formatter_class=RichHelp,
)
p.set_defaults(
func=lambda args, cls=spec["runner"]: cls(args).run(),
_runner_class=spec["runner"],
)
if "add_arguments" in spec and callable(spec["add_arguments"]):
spec["add_arguments"](p)
if common_args:
common_args(p)

View File

@ -0,0 +1,42 @@
"""
Docker Utility helpers for CLI tasks.
"""
import logging
from typing import Optional
import docker
from docker.errors import APIError, NotFound
logger = logging.getLogger(__name__)
# lazy singleton so we don't reconnect every call
_docker_client: Optional[docker.DockerClient] = None
def _get_client() -> docker.DockerClient:
global _docker_client
if _docker_client is None:
_docker_client = docker.from_env()
return _docker_client
def local_image_exists(
image_name: str, client: Optional[docker.DockerClient] = None
) -> bool:
"""Return True if a local Docker image exists."""
if not image_name:
return False
client = client or _get_client()
try:
client.images.get(image_name)
return True
except (NotFound, APIError) as e:
logger.error(
"Error when checking Docker image '%s': %s",
image_name,
e.explanation if hasattr(e, "explanation") else str(e),
)
return False

View File

@ -0,0 +1,110 @@
"""
Environment Variables and Dataclasses Utility helpers for CLI tasks.
"""
import os
from dataclasses import field, fields, is_dataclass, MISSING
from pathlib import Path
from textwrap import indent
from typing import Optional, Union
from cli.lib.common.utils import str2bool
def get_env(name: str, default: str = "") -> str:
"""Get environment variable with default fallback."""
return os.environ.get(name) or default
def env_path_optional(
name: str,
default: Optional[Union[str, Path]] = None,
resolve: bool = True,
) -> Optional[Path]:
"""Get environment variable as optional Path."""
val = get_env(name) or default
if not val:
return None
path = Path(val)
return path.resolve() if resolve else path
def env_path(
name: str,
default: Optional[Union[str, Path]] = None,
resolve: bool = True,
) -> Path:
"""Get environment variable as Path, raise if missing."""
path = env_path_optional(name, default, resolve)
if not path:
raise ValueError(f"Missing path value for {name}")
return path
def env_bool(
name: str,
default: bool = False,
) -> bool:
val = get_env(name)
if not val:
return default
return str2bool(val)
def env_bool_field(
name: str,
default: bool = False,
):
return field(default_factory=lambda: env_bool(name, default))
def env_path_field(
name: str,
default: Union[str, Path] = "",
*,
resolve: bool = True,
) -> Path:
return field(default_factory=lambda: env_path(name, default, resolve=resolve))
def env_str_field(
name: str,
default: str = "",
) -> str:
return field(default_factory=lambda: get_env(name, default))
def generate_dataclass_help(cls) -> str:
"""Auto-generate help text for dataclass fields."""
if not is_dataclass(cls):
raise TypeError(f"{cls} is not a dataclass")
def get_value(f):
if f.default is not MISSING:
return f.default
if f.default_factory is not MISSING:
try:
return f.default_factory()
except Exception as e:
return f"<error: {e}>"
return "<required>"
lines = [f"{f.name:<22} = {repr(get_value(f))}" for f in fields(cls)]
return indent("\n".join(lines), " ")
def with_params_help(params_cls: type, title: str = "Parameter defaults"):
"""
Class decorator that appends a help table generated from another dataclass
(e.g., VllmParameters) to the decorated class's docstring.
"""
if not is_dataclass(params_cls):
raise TypeError(f"{params_cls} must be a dataclass")
def _decorator(cls: type) -> type:
block = generate_dataclass_help(params_cls)
cls.__doc__ = (cls.__doc__ or "") + f"\n\n{title}:\n{block}"
return cls
return _decorator

View File

@ -0,0 +1,143 @@
from __future__ import annotations
import logging
import os
import textwrap
from pathlib import Path
from typing import TYPE_CHECKING
from cli.lib.common.utils import get_wheels
from jinja2 import Template
if TYPE_CHECKING:
from collections.abc import Iterable, Mapping
logger = logging.getLogger(__name__)
_TPL_CONTENT = Template(
textwrap.dedent("""\
## {{ title }}
```{{ lang }}
{{ content }}
```
""")
)
_TPL_LIST_ITEMS = Template(
textwrap.dedent("""\
## {{ title }}
{% for it in items %}
- {{ it.pkg }}: {{ it.relpath }}
{% else %}
_(no item found)_
{% endfor %}
""")
)
_TPL_TABLE = Template(
textwrap.dedent("""\
{%- if rows %}
| {{ cols | join(' | ') }} |
|{%- for _ in cols %} --- |{%- endfor %}
{%- for r in rows %}
| {%- for c in cols %} {{ r.get(c, "") }} |{%- endfor %}
{%- endfor %}
{%- else %}
_(no data)_
{%- endif %}
""")
)
def gh_summary_path() -> Path | None:
"""Return the Path to the GitHub step summary file, or None if not set."""
p = os.environ.get("GITHUB_STEP_SUMMARY")
return Path(p) if p else None
def write_gh_step_summary(md: str, *, append_content: bool = True) -> bool:
"""
Write Markdown content to the GitHub Step Summary file if GITHUB_STEP_SUMMARY is set.
append_content: default true, if True, append to the end of the file, else overwrite the whole file
Returns:
True if written successfully (in GitHub Actions environment),
False if skipped (e.g., running locally where the variable is not set).
"""
sp = gh_summary_path()
if not sp:
logger.info("[gh-summary] GITHUB_STEP_SUMMARY not set, skipping write.")
return False
md_clean = textwrap.dedent(md).strip() + "\n"
mode = "a" if append_content else "w"
with sp.open(mode, encoding="utf-8") as f:
f.write(md_clean)
return True
def md_heading(text: str, level: int = 2) -> str:
"""Generate a Markdown heading string with the given level (1-6)."""
return f"{'#' * max(1, min(level, 6))} {text}\n"
def md_details(summary: str, content: str) -> str:
"""Generate a collapsible <details> block with a summary and inner content."""
return f"<details>\n<summary>{summary}</summary>\n\n{content}\n\n</details>\n"
def summarize_content_from_file(
output_dir: Path,
freeze_file: str,
title: str = "Content from file",
code_lang: str = "", # e.g. "text" or "ini"
) -> bool:
f = Path(output_dir) / freeze_file
if not f.exists():
return False
content = f.read_text(encoding="utf-8").strip()
md = render_content(content, title=title, lang=code_lang)
return write_gh_step_summary(md)
def summarize_wheels(path: Path, title: str = "Wheels", max_depth: int = 3):
items = get_wheels(path, max_depth=max_depth)
if not items:
return False
md = render_list(items, title=title)
return write_gh_step_summary(md)
def md_kv_table(rows: Iterable[Mapping[str, str | int | float]]) -> str:
"""
Render a list of dicts as a Markdown table using Jinja template.
"""
rows = list(rows)
cols = list({k for r in rows for k in r.keys()})
md = _TPL_TABLE.render(cols=cols, rows=rows).strip() + "\n"
return md
def render_list(
items: Iterable[str],
*,
title: str = "List",
) -> str:
tpl = _TPL_LIST_ITEMS
md = tpl.render(title=title, items=items)
return md
def render_content(
content: str,
*,
title: str = "Content",
lang: str = "text",
) -> str:
tpl = _TPL_CONTENT
md = tpl.render(title=title, content=content, lang=lang)
return md

View File

@ -0,0 +1,69 @@
"""
Git Utility helpers for CLI tasks.
"""
import logging
from pathlib import Path
from cli.lib.common.path_helper import remove_dir
from git import GitCommandError, RemoteProgress, Repo
logger = logging.getLogger(__name__)
class PrintProgress(RemoteProgress):
"""Simple progress logger for git operations."""
def __init__(self, interval: int = 5):
super().__init__()
self._last_percent = -1
self._interval = interval
def update(self, op_code, cur, max=None, message=""):
msg = self._cur_line or message
if max and cur:
percent = int(cur / max * 100)
if percent != self._last_percent and percent % self._interval == 0:
self._last_percent = percent
logger.info("Progress: %d%% - %s", percent, msg)
elif msg:
logger.info(msg)
def clone_external_repo(target: str, repo: str, dst: str = "", update_submodules=False):
"""Clone repository with pinned commit and optional submodules."""
dst = dst or target
try:
logger.info("Cloning %s to %s", target, dst)
# Clone and fetch
remove_dir(dst)
r = Repo.clone_from(repo, dst, progress=PrintProgress())
r.git.fetch("--all", "--tags")
# Checkout pinned commit
commit = get_post_build_pinned_commit(target)
logger.info("Checking out pinned %s commit %s", target, commit)
r.git.checkout(commit)
# Update submodules if requested
if update_submodules and r.submodules:
logger.info("Updating %d submodule(s)", len(r.submodules))
for sm in r.submodules:
sm.update(init=True, recursive=True, progress=PrintProgress())
logger.info("Successfully cloned %s", target)
return r, commit
except GitCommandError as e:
logger.error("Git operation failed: %s", e)
raise
def get_post_build_pinned_commit(name: str, prefix=".github/ci_commit_pins") -> str:
path = Path(prefix) / f"{name}.txt"
if not path.exists():
raise FileNotFoundError(f"Pin file not found: {path}")
return path.read_text(encoding="utf-8").strip()

View File

@ -0,0 +1,14 @@
"""
Logger Utility helpers for CLI tasks.
"""
import logging
import sys
def setup_logging(level: int = logging.INFO):
logging.basicConfig(
level=level,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
stream=sys.stdout,
)

View File

@ -0,0 +1,62 @@
"""Path utility helpers for CLI tasks."""
import logging
import shutil
from pathlib import Path
from typing import Union
logger = logging.getLogger(__name__)
def get_path(path: Union[str, Path], resolve: bool = False) -> Path:
"""Convert to Path object, optionally resolving to absolute path."""
if not path:
raise ValueError("Path cannot be None or empty")
result = Path(path)
return result.resolve() if resolve else result
def ensure_dir_exists(path: Union[str, Path]) -> Path:
"""Create directory if it doesn't exist."""
path_obj = get_path(path)
path_obj.mkdir(parents=True, exist_ok=True)
return path_obj
def remove_dir(path: Union[str, Path, None]) -> None:
"""Remove directory if it exists."""
if not path:
return
path_obj = get_path(path)
if path_obj.exists():
shutil.rmtree(path_obj)
def force_create_dir(path: Union[str, Path]) -> Path:
"""Remove directory if exists, then create fresh empty directory."""
remove_dir(path)
return ensure_dir_exists(path)
def copy(src: Union[str, Path], dst: Union[str, Path]) -> None:
"""Copy file or directory from src to dst."""
src_path = get_path(src, resolve=True)
dst_path = get_path(dst, resolve=True)
if not src_path.exists():
raise FileNotFoundError(f"Source does not exist: {src_path}")
dst_path.parent.mkdir(parents=True, exist_ok=True)
if src_path.is_file():
shutil.copy2(src_path, dst_path)
elif src_path.is_dir():
shutil.copytree(src_path, dst_path, dirs_exist_ok=True)
else:
raise ValueError(f"Unsupported path type: {src_path}")
def is_path_exist(path: Union[str, Path, None]) -> bool:
"""Check if path exists."""
return bool(path and get_path(path).exists())

View File

@ -0,0 +1,71 @@
import glob
import logging
import shlex
import shutil
import sys
from collections.abc import Iterable
from importlib.metadata import PackageNotFoundError, version # noqa: UP035
from typing import Optional, Union
from cli.lib.common.utils import run_command
logger = logging.getLogger(__name__)
def pip_install_packages(
packages: Iterable[str] = (),
env=None,
*,
requirements: Optional[str] = None,
constraints: Optional[str] = None,
prefer_uv: bool = False,
) -> None:
use_uv = prefer_uv and shutil.which("uv") is not None
base = (
[sys.executable, "-m", "uv", "pip", "install"]
if use_uv
else [sys.executable, "-m", "pip", "install"]
)
cmd = base[:]
if requirements:
cmd += ["-r", requirements]
if constraints:
cmd += ["-c", constraints]
cmd += list(packages)
logger.info("pip installing packages: %s", " ".join(map(shlex.quote, cmd)))
run_command(" ".join(map(shlex.quote, cmd)), env=env)
def pip_install_first_match(pattern: str, extras: Optional[str] = None, pref_uv=False):
wheel = first_matching_pkg(pattern)
target = f"{wheel}[{extras}]" if extras else wheel
logger.info("Installing %s...", target)
pip_install_packages([target], prefer_uv=pref_uv)
def run_python(args: Union[str, list[str]], env=None):
"""
Run the python in the current environment.
"""
if isinstance(args, str):
args = shlex.split(args)
cmd = [sys.executable] + args
run_command(" ".join(map(shlex.quote, cmd)), env=env)
def pkg_exists(name: str) -> bool:
try:
pkg_version = version(name)
logger.info("%s already exist with version: %s", name, pkg_version)
return True
except PackageNotFoundError:
logger.info("%s is not installed", name)
return False
def first_matching_pkg(pattern: str) -> str:
matches = sorted(glob.glob(pattern))
if not matches:
raise FileNotFoundError(f"No wheel matching: {pattern}")
return matches[0]

View File

@ -0,0 +1,139 @@
"""
General Utility helpers for CLI tasks.
"""
import logging
import os
import shlex
import subprocess
import sys
from contextlib import contextmanager
from pathlib import Path
from typing import Optional
logger = logging.getLogger(__name__)
def run_command(
cmd: str,
use_shell: bool = False,
log_cmd: bool = True,
cwd: Optional[str] = None,
env: Optional[dict] = None,
check: bool = True,
) -> int:
"""Run a command with optional shell execution."""
if use_shell:
args = cmd
log_prefix = "[shell]"
executable = "/bin/bash"
else:
args = shlex.split(cmd)
log_prefix = "[cmd]"
executable = None
if log_cmd:
display_cmd = cmd if use_shell else " ".join(args)
logger.info("%s %s", log_prefix, display_cmd)
run_env = {**os.environ, **(env or {})}
proc = subprocess.run(
args,
shell=use_shell,
executable=executable,
stdout=sys.stdout,
stderr=sys.stderr,
cwd=cwd,
env=run_env,
check=False,
)
if check and proc.returncode != 0:
logger.error(
"%s Command failed (exit %s): %s", log_prefix, proc.returncode, cmd
)
raise subprocess.CalledProcessError(
proc.returncode, args if not use_shell else cmd
)
return proc.returncode
def str2bool(value: Optional[str]) -> bool:
"""Convert environment variables to boolean values."""
if not value:
return False
if not isinstance(value, str):
raise ValueError(
f"Expected a string value for boolean conversion, got {type(value)}"
)
value = value.strip().lower()
true_value_set = {"1", "true", "t", "yes", "y", "on", "enable", "enabled", "found"}
false_value_set = {"0", "false", "f", "no", "n", "off", "disable"}
if value in true_value_set:
return True
if value in false_value_set:
return False
raise ValueError(f"Invalid string value for boolean conversion: {value}")
@contextmanager
def temp_environ(updates: dict[str, str]):
"""
Temporarily set environment variables and restore them after the block.
Args:
updates: Dict of environment variables to set.
"""
missing = object()
old: dict[str, str | object] = {k: os.environ.get(k, missing) for k in updates}
try:
os.environ.update(updates)
yield
finally:
for k, v in old.items():
if v is missing:
os.environ.pop(k, None)
else:
os.environ[k] = v # type: ignore[arg-type]
@contextmanager
def working_directory(path: str):
"""
Temporarily change the working directory inside a context.
"""
if not path:
# No-op context
yield
return
prev_cwd = os.getcwd()
try:
os.chdir(path)
yield
finally:
os.chdir(prev_cwd)
def get_wheels(
output_dir: Path,
max_depth: Optional[int] = None,
) -> list[str]:
"""Return a list of wheels found in the given output directory."""
root = Path(output_dir)
if not root.exists():
return []
items = []
for dirpath, _, filenames in os.walk(root):
depth = Path(dirpath).relative_to(root).parts
if max_depth is not None and len(depth) > max_depth:
continue
for fname in sorted(filenames):
if fname.endswith(".whl"):
pkg = fname.split("-")[0]
relpath = str((Path(dirpath) / fname).relative_to(root))
items.append({"pkg": pkg, "relpath": relpath})
return items

View File

@ -0,0 +1,292 @@
import logging
import os
import textwrap
from typing import Any
from cli.lib.common.gh_summary import write_gh_step_summary
from cli.lib.common.git_helper import clone_external_repo
from cli.lib.common.pip_helper import pip_install_packages
from cli.lib.common.utils import run_command, temp_environ, working_directory
from jinja2 import Template
logger = logging.getLogger(__name__)
_TPL_VLLM_INFO = Template(
textwrap.dedent("""\
## Vllm against Pytorch CI Test Summary
**Vllm Commit**: [{{ vllm_commit }}](https://github.com/vllm-project/vllm/commit/{{ vllm_commit }})
{%- if torch_sha %}
**Pytorch Commit**: [{{ torch_sha }}](https://github.com/pytorch/pytorch/commit/{{ torch_sha }})
{%- endif %}
""")
)
def sample_vllm_test_library():
"""
Simple sample to unblock the vllm ci development, which is mimic to
https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml
see run_test_plan for more details
"""
# TODO(elainewy): Read from yaml file to handle the env and tests for vllm
return {
"vllm_basic_correctness_test": {
"title": "Basic Correctness Test",
"id": "vllm_basic_correctness_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"steps": [
"pytest -v -s basic_correctness/test_cumem.py",
"pytest -v -s basic_correctness/test_basic_correctness.py",
"pytest -v -s basic_correctness/test_cpu_offload.py",
],
},
"vllm_basic_models_test": {
"title": "Basic models test",
"id": "vllm_basic_models_test",
"steps": [
"pytest -v -s models/test_transformers.py",
"pytest -v -s models/test_registry.py",
"pytest -v -s models/test_utils.py",
"pytest -v -s models/test_vision.py",
"pytest -v -s models/test_initialization.py",
],
},
"vllm_entrypoints_test": {
"title": "Entrypoints Test ",
"id": "vllm_entrypoints_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"steps": [
" ".join(
[
"pytest",
"-v",
"-s",
"entrypoints/llm",
"--ignore=entrypoints/llm/test_generate.py",
"--ignore=entrypoints/llm/test_collective_rpc.py",
]
),
"pytest -v -s entrypoints/llm/test_generate.py",
"pytest -v -s entrypoints/offline_mode",
],
},
"vllm_regression_test": {
"title": "Regression Test",
"id": "vllm_regression_test",
"package_install": ["modelscope"],
"steps": [
"pytest -v -s test_regression.py",
],
},
"vllm_lora_tp_test_distributed": {
"title": "LoRA TP Test (Distributed)",
"id": "vllm_lora_tp_test_distributed",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s -x lora/test_chatglm3_tp.py",
"pytest -v -s -x lora/test_llama_tp.py",
"pytest -v -s -x lora/test_llm_with_multi_loras.py",
],
},
"vllm_distributed_test_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
"vllm_lora_28_failure_test": {
"title": "LoRA pytorch 2.8 failure test",
"id": "vllm_lora_28_failure_test",
"steps": ["pytest -v lora/test_quant_model.py"],
},
"vllm_multi_model_processor_test": {
"title": "Multi-Modal Processor Test",
"id": "vllm_multi_model_processor_test",
"package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],
"steps": [
"pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py",
],
},
"vllm_multi_model_test_28_failure_test": {
"title": "Multi-Model Test (Failed 2.8 release)",
"id": "vllm_multi_model_test_28_failure_test",
"package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],
"steps": [
"pytest -v -s models/multimodal/generation/test_voxtral.py",
"pytest -v -s models/multimodal/pooling",
],
},
"vllm_pytorch_compilation_unit_tests": {
"title": "PyTorch Compilation Unit Tests",
"id": "vllm_pytorch_compilation_unit_tests",
"steps": [
"pytest -v -s compile/test_pass_manager.py",
"pytest -v -s compile/test_fusion.py",
"pytest -v -s compile/test_fusion_attn.py",
"pytest -v -s compile/test_silu_mul_quant_fusion.py",
"pytest -v -s compile/test_sequence_parallelism.py",
"pytest -v -s compile/test_async_tp.py",
"pytest -v -s compile/test_fusion_all_reduce.py",
"pytest -v -s compile/test_decorator.py",
],
},
"vllm_languagde_model_test_extended_generation_28_failure_test": {
"title": "Language Models Test (Extended Generation) 2.8 release failure",
"id": "vllm_languagde_model_test_extended_generation_28_failure_test",
"package_install": [
"--no-build-isolation",
"git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8",
],
"steps": [
"pytest -v -s models/language/generation/test_mistral.py",
],
},
"vllm_distributed_test_2_gpu_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_2_gpu_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
# TODO(elainewy):need to add g6 with 4 gpus to run this test
"vllm_lora_test": {
"title": "LoRA Test %N",
"id": "lora_test",
"parallelism": 4,
"steps": [
"echo '[checking] list sharded lora tests:'",
" ".join(
[
"pytest -q --collect-only lora",
"--shard-id=$$BUILDKITE_PARALLEL_JOB",
"--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",
"--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",
]
),
"echo '[checking] Done. list lora tests'",
" ".join(
[
"pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB",
"--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT",
"--ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py",
]
),
],
},
}
def check_parallelism(tests: Any, title: str, shard_id: int = 0, num_shards: int = 0):
"""
a method to check if the test plan is parallelism or not.
"""
parallelism = int(tests.get("parallelism", "0"))
is_parallel = parallelism and parallelism > 1
if not is_parallel:
return False
if shard_id > num_shards:
raise RuntimeError(
f"Test {title} expects {num_shards} shards, but invalid {shard_id} is provided"
)
if num_shards != parallelism:
raise RuntimeError(
f"Test {title} expects {parallelism} shards, but invalid {num_shards} is provided"
)
return True
def run_test_plan(
test_plan: str,
test_target: str,
tests_map: dict[str, Any],
shard_id: int = 0,
num_shards: int = 0,
):
"""
a method to run list of tests based on the test plan.
"""
logger.info("run %s tests.....", test_target)
if test_plan not in tests_map:
raise RuntimeError(
f"test {test_plan} not found, please add it to test plan pool"
)
tests = tests_map[test_plan]
pkgs = tests.get("package_install", [])
title = tests.get("title", "unknown test")
is_parallel = check_parallelism(tests, title, shard_id, num_shards)
if is_parallel:
title = title.replace("%N", f"{shard_id}/{num_shards}")
logger.info("Running tests: %s", title)
if pkgs:
logger.info("Installing packages: %s", pkgs)
pip_install_packages(packages=pkgs, prefer_uv=True)
with (
working_directory(tests.get("working_directory", "tests")),
temp_environ(tests.get("env_vars", {})),
):
failures = []
for step in tests["steps"]:
logger.info("Running step: %s", step)
if is_parallel:
step = replace_buildkite_placeholders(step, shard_id, num_shards)
logger.info("Running parallel step: %s", step)
code = run_command(cmd=step, check=False, use_shell=True)
if code != 0:
failures.append(step)
logger.info("Finish running step: %s", step)
if failures:
logger.error("Failed tests: %s", failures)
raise RuntimeError(f"{len(failures)} pytest runs failed: {failures}")
logger.info("Done. All tests passed")
def clone_vllm(dst: str = "vllm"):
_, commit = clone_external_repo(
target="vllm",
repo="https://github.com/vllm-project/vllm.git",
dst=dst,
update_submodules=True,
)
return commit
def replace_buildkite_placeholders(step: str, shard_id: int, num_shards: int) -> str:
mapping = {
"$$BUILDKITE_PARALLEL_JOB_COUNT": str(num_shards),
"$$BUILDKITE_PARALLEL_JOB": str(shard_id),
}
for k in sorted(mapping, key=len, reverse=True):
step = step.replace(k, mapping[k])
return step
def summarize_build_info(vllm_commit: str) -> bool:
torch_sha = os.getenv("GITHUB_SHA")
md = (
_TPL_VLLM_INFO.render(vllm_commit=vllm_commit, torch_sha=torch_sha).strip()
+ "\n"
)
return write_gh_step_summary(md)

View File

@ -0,0 +1,285 @@
import logging
import os
import textwrap
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
from cli.lib.common.cli_helper import BaseRunner
from cli.lib.common.docker_helper import local_image_exists
from cli.lib.common.envs_helper import (
env_bool_field,
env_path_field,
env_str_field,
with_params_help,
)
from cli.lib.common.gh_summary import (
gh_summary_path,
summarize_content_from_file,
summarize_wheels,
)
from cli.lib.common.path_helper import (
copy,
ensure_dir_exists,
force_create_dir,
get_path,
is_path_exist,
)
from cli.lib.common.utils import run_command
from cli.lib.core.vllm.lib import clone_vllm, summarize_build_info
logger = logging.getLogger(__name__)
# Default path for docker build artifacts
_DEFAULT_RESULT_PATH = "./shared"
# Temp folder in vllm work place to cp torch whls in vllm work directory for docker build
_VLLM_TEMP_FOLDER = "tmp"
@dataclass
class VllmBuildParameters:
"""
Parameters defining the vllm external input configurations.
Combine with VllmDockerBuildArgs to define the vllm build environment
"""
# USE_TORCH_WHEEL: when true, use local Torch wheels; requires TORCH_WHEELS_PATH.
# Otherwise docker build pull torch nightly during build
# TORCH_WHEELS_PATH: directory containing local torch wheels when use_torch_whl is True
use_torch_whl: bool = env_bool_field("USE_TORCH_WHEEL", True)
torch_whls_path: Path = env_path_field("TORCH_WHEELS_PATH", "./dist")
# USE_LOCAL_BASE_IMAGE: when true, use an existing local Docker base image; requires BASE_IMAGE
# Otherwise, pull dockerfile's default image remotely
# BASE_IMAGE: name:tag (only needed when use_local_base_image is True)
use_local_base_image: bool = env_bool_field("USE_LOCAL_BASE_IMAGE", True)
base_image: str = env_str_field("BASE_IMAGE")
# USE_LOCAL_DOCKERFILE: when true("1"), use a local Dockerfile; requires DOCKERFILE_PATH.
# otherwise, use vllm's default dockerfile.torch_nightly for build
# DOCKERFILE_PATH: path to Dockerfile used when use_local_dockerfile is True"
use_local_dockerfile: bool = env_bool_field("USE_LOCAL_DOCKERFILE", True)
dockerfile_path: Path = env_path_field(
"DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"
)
# OUTPUT_DIR: where docker buildx (local exporter) will write artifacts
output_dir: Path = env_path_field("OUTPUT_DIR", "external/vllm")
# --- Build args ----------------------------------------------------------
target_stage: str = env_str_field("TARGET_STAGE", "export-wheels")
tag_name: str = env_str_field("TAG", "vllm-wheels")
cuda_version: str = env_str_field("CUDA_VERSION", "12.8.1")
python_version: str = env_str_field("PYTHON_VERSION", "3.12")
max_jobs: str = env_str_field("MAX_JOBS", "64")
sccache_bucket: str = env_str_field("SCCACHE_BUCKET")
sccache_region: str = env_str_field("SCCACHE_REGION")
torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")
def __post_init__(self):
checks = [
(
self.use_torch_whl, # flag
True, # trigger_value
"torch_whls_path", # resource
is_path_exist, # check_func
"TORCH_WHEELS_PATH is not provided, but USE_TORCH_WHEEL is set to 1",
),
(
self.use_local_base_image,
True,
"base_image",
local_image_exists,
f"BASE_IMAGE {self.base_image} does not found, but USE_LOCAL_BASE_IMAGE is set to 1",
),
(
self.use_local_dockerfile,
True,
"dockerfile_path",
is_path_exist,
" DOCKERFILE_PATH path does not found, but USE_LOCAL_DOCKERFILE is set to 1",
),
]
for flag, trigger_value, attr_name, check_func, error_msg in checks:
value = getattr(self, attr_name)
if flag == trigger_value:
if not value or not check_func(value):
raise ValueError(error_msg)
else:
logger.info("flag %s is not set", flag)
if not self.output_dir:
raise ValueError("missing required output_dir")
@with_params_help(VllmBuildParameters)
class VllmBuildRunner(BaseRunner):
"""
Build vLLM using docker buildx.
Environment variable options:
"USE_TORCH_WHEEL": "1: use local wheels; 0: pull nightly from pypi",
"TORCH_WHEELS_PATH": "Path to local wheels (when USE_TORCH_WHEEL=1)",
"USE_LOCAL_BASE_IMAGE": "1: use local base image; 0: default image",
"BASE_IMAGE": "name:tag to indicate base image the dockerfile depends on (when USE_LOCAL_BASE_IMAGE=1)",
"USE_LOCAL_DOCKERFILE": "1: use local Dockerfile; 0: vllm repo default dockerfile.torch_nightly",
"DOCKERFILE_PATH": "Path to Dockerfile (when USE_LOCAL_DOCKERFILE=1)",
"OUTPUT_DIR": "e.g. './shared'",
"TORCH_CUDA_ARCH_LIST": "e.g. '8.0' or '8.0;9.0'",
"CUDA_VERSION": "e.g. '12.8.1'",
"PYTHON_VERSION": "e.g. '3.12'",
"MAX_JOBS": "e.g. '64'",
"SCCACHE_BUCKET": "e.g. 'my-bucket'",
"SCCACHE_REGION": "e.g. 'us-west-2'",
"""
def __init__(self, args=None):
self.work_directory = "vllm"
def run(self):
"""
main function to run vllm build
1. prepare vllm build environment
2. prepare the docker build command args
3. run docker build
"""
inputs = VllmBuildParameters()
logger.info("Running vllm build with inputs: %s", inputs)
vllm_commit = clone_vllm()
self.cp_dockerfile_if_exist(inputs)
# cp torch wheels from root direct to vllm workspace if exist
self.cp_torch_whls_if_exist(inputs)
# make sure the output dir to store the build artifacts exist
ensure_dir_exists(Path(inputs.output_dir))
cmd = self._generate_docker_build_cmd(inputs)
logger.info("Running docker build: \n %s", cmd)
try:
run_command(cmd, cwd="vllm", env=os.environ.copy())
finally:
self.genearte_vllm_build_summary(vllm_commit, inputs)
def genearte_vllm_build_summary(
self, vllm_commit: str, inputs: VllmBuildParameters
):
if not gh_summary_path():
return logger.info("Skipping, not detect GH Summary env var....")
logger.info("Generate GH Summary ...")
# summarize vllm build info
summarize_build_info(vllm_commit)
# summarize vllm build artifacts
vllm_artifact_dir = inputs.output_dir / "wheels"
summarize_content_from_file(
vllm_artifact_dir,
"build_summary.txt",
title="Vllm build env pip package summary",
)
summarize_wheels(
inputs.torch_whls_path, max_depth=3, title="Torch Wheels Artifacts"
)
summarize_wheels(vllm_artifact_dir, max_depth=3, title="Vllm Wheels Artifacts")
def cp_torch_whls_if_exist(self, inputs: VllmBuildParameters) -> str:
if not inputs.use_torch_whl:
return ""
tmp_dir = f"./{self.work_directory}/{_VLLM_TEMP_FOLDER}"
tmp_path = Path(tmp_dir)
force_create_dir(tmp_path)
copy(inputs.torch_whls_path, tmp_dir)
return tmp_dir
def cp_dockerfile_if_exist(self, inputs: VllmBuildParameters):
if not inputs.use_local_dockerfile:
logger.info("using vllm default dockerfile.torch_nightly for build")
return
dockerfile_path = get_path(inputs.dockerfile_path, resolve=True)
vllm_torch_dockerfile = Path(
f"./{self.work_directory}/docker/Dockerfile.nightly_torch"
)
copy(dockerfile_path, vllm_torch_dockerfile)
def get_result_path(self, path):
"""
Get the absolute path of the result path
"""
if not path:
path = _DEFAULT_RESULT_PATH
abs_path = get_path(path, resolve=True)
return abs_path
def _get_torch_wheel_path_arg(self, torch_whl_dir: Optional[Path]) -> str:
if not torch_whl_dir:
return ""
return f"--build-arg TORCH_WHEELS_PATH={_VLLM_TEMP_FOLDER}"
def _get_base_image_args(self, inputs: VllmBuildParameters) -> tuple[str, str, str]:
"""
Returns:
- base_image_arg: docker buildx arg string for base image
- final_base_image_arg: docker buildx arg string for vllm-base stage
- pull_flag: --pull=true or --pull=false depending on whether the image exists locally
"""
if not inputs.use_local_base_image:
return "", "", ""
base_image = inputs.base_image
# set both base image and final base image to the same local image
base_image_arg = f"--build-arg BUILD_BASE_IMAGE={base_image}"
final_base_image_arg = f"--build-arg FINAL_BASE_IMAGE={base_image}"
if local_image_exists(base_image):
pull_flag = "--pull=false"
return base_image_arg, final_base_image_arg, pull_flag
logger.info(
"[INFO] Local image not found:%s will try to pull from remote", {base_image}
)
return base_image_arg, final_base_image_arg, ""
def _generate_docker_build_cmd(
self,
inputs: VllmBuildParameters,
) -> str:
base_image_arg, final_base_image_arg, pull_flag = self._get_base_image_args(
inputs
)
torch_arg = self._get_torch_wheel_path_arg(inputs.torch_whls_path)
return textwrap.dedent(
f"""
docker buildx build \
--output type=local,dest={inputs.output_dir} \
-f docker/Dockerfile.nightly_torch \
{pull_flag} \
{torch_arg} \
{base_image_arg} \
{final_base_image_arg} \
--build-arg max_jobs={inputs.max_jobs} \
--build-arg CUDA_VERSION={inputs.cuda_version} \
--build-arg PYTHON_VERSION={inputs.python_version} \
--build-arg USE_SCCACHE={int(bool(inputs.sccache_bucket and inputs.sccache_region))} \
--build-arg SCCACHE_BUCKET_NAME={inputs.sccache_bucket} \
--build-arg SCCACHE_REGION_NAME={inputs.sccache_region} \
--build-arg torch_cuda_arch_list='{inputs.torch_cuda_arch_list}' \
--target {inputs.target_stage} \
-t {inputs.tag_name} \
--progress=plain .
"""
).strip()

View File

@ -0,0 +1,269 @@
import logging
import os
import re
import subprocess
import sys
from collections.abc import Iterable
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
from typing import Any
from cli.lib.common.cli_helper import BaseRunner
from cli.lib.common.envs_helper import env_path_field, env_str_field, get_env
from cli.lib.common.path_helper import copy, remove_dir
from cli.lib.common.pip_helper import (
pip_install_first_match,
pip_install_packages,
pkg_exists,
run_python,
)
from cli.lib.common.utils import run_command, working_directory
from cli.lib.core.vllm.lib import clone_vllm, run_test_plan, sample_vllm_test_library
logger = logging.getLogger(__name__)
@dataclass
class VllmTestParameters:
"""
Parameters defining the vllm external test input
!!!DO NOT ADD SECRETS IN THIS CLASS!!!
you can put environment variable name in VllmTestParameters if it's not the same as the secret one
fetch secrests directly from env variables during runtime
"""
torch_whls_path: Path = env_path_field("WHEELS_PATH", "./dist")
vllm_whls_path: Path = env_path_field(
"VLLM_WHEELS_PATH", "./dist/external/vllm/wheels"
)
torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")
def __post_init__(self):
if not self.torch_whls_path.exists():
raise ValueError("missing torch_whls_path")
if not self.vllm_whls_path.exists():
raise ValueError("missing vllm_whls_path")
class TestInpuType(Enum):
TEST_PLAN = "test_plan"
UNKNOWN = "unknown"
class VllmTestRunner(BaseRunner):
def __init__(self, args: Any):
self.work_directory = "vllm"
self.test_plan = ""
self.test_type = TestInpuType.UNKNOWN
self.shard_id = args.shard_id
self.num_shards = args.num_shards
if args.test_plan:
self.test_plan = args.test_plan
self.test_type = TestInpuType.TEST_PLAN
# Matches the structeur in the artifacts.zip from torcb build
self.TORCH_WHL_PATH_REGEX = "torch*.whl"
self.TORCH_WHL_EXTRA = "opt-einsum"
self.TORCH_ADDITIONAL_WHLS_REGEX = [
"vision/torchvision*.whl",
"audio/torchaudio*.whl",
]
# Match the structure of the artifacts.zip from vllm external build
self.VLLM_TEST_WHLS_REGEX = [
"xformers/*.whl",
"vllm/vllm*.whl",
"flashinfer-python/flashinfer*.whl",
]
def prepare(self):
"""
prepare test environment for vllm. This includes clone vllm repo, install all wheels, test dependencies and set env
"""
params = VllmTestParameters()
logger.info("Display VllmTestParameters %s", params)
self._set_envs(params)
clone_vllm(dst=self.work_directory)
with working_directory(self.work_directory):
remove_dir(Path("vllm"))
self._install_wheels(params)
self._install_dependencies()
# verify the torches are not overridden by test dependencies
check_versions()
def run(self):
"""
main function to run vllm test
"""
self.prepare()
try:
with working_directory(self.work_directory):
if self.test_type == TestInpuType.TEST_PLAN:
if self.num_shards > 1:
run_test_plan(
self.test_plan,
"vllm",
sample_vllm_test_library(),
self.shard_id,
self.num_shards,
)
else:
run_test_plan(
self.test_plan, "vllm", sample_vllm_test_library()
)
else:
raise ValueError(f"Unknown test type {self.test_type}")
finally:
# double check the torches are not overridden by other packages
check_versions()
def _install_wheels(self, params: VllmTestParameters):
logger.info("Running vllm test with inputs: %s", params)
if not pkg_exists("torch"):
# install torch from local whls if it's not installed yet.
torch_p = f"{str(params.torch_whls_path)}/{self.TORCH_WHL_PATH_REGEX}"
pip_install_first_match(torch_p, self.TORCH_WHL_EXTRA)
torch_whls_path = [
f"{str(params.torch_whls_path)}/{whl_path}"
for whl_path in self.TORCH_ADDITIONAL_WHLS_REGEX
]
for torch_whl in torch_whls_path:
pip_install_first_match(torch_whl)
logger.info("Done. Installed torch and other torch-related wheels ")
logger.info("Installing vllm wheels")
vllm_whls_path = [
f"{str(params.vllm_whls_path)}/{whl_path}"
for whl_path in self.VLLM_TEST_WHLS_REGEX
]
for vllm_whl in vllm_whls_path:
pip_install_first_match(vllm_whl)
logger.info("Done. Installed vllm wheels")
def _install_test_dependencies(self):
"""
This method replaces torch dependencies with local torch wheel info in
requirements/test.in file from vllm repo. then generates the test.txt
in runtime
"""
logger.info("generate test.txt from requirements/test.in with local torch whls")
preprocess_test_in()
copy("requirements/test.txt", "snapshot_constraint.txt")
run_command(
f"{sys.executable} -m uv pip compile requirements/test.in "
"-o test.txt "
"--index-strategy unsafe-best-match "
"--constraint snapshot_constraint.txt "
"--torch-backend cu128"
)
pip_install_packages(requirements="test.txt", prefer_uv=True)
logger.info("Done. installed requirements for test dependencies")
def _install_dependencies(self):
pip_install_packages(packages=["-e", "tests/vllm_test_utils"], prefer_uv=True)
pip_install_packages(packages=["hf_transfer"], prefer_uv=True)
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
# using script from vllm repo to remove all torch packages from requirements txt
run_python("use_existing_torch.py")
# install common packages
for requirements in ["requirements/common.txt", "requirements/build.txt"]:
pip_install_packages(
requirements=requirements,
prefer_uv=True,
)
# install test packages
self._install_test_dependencies()
def _set_envs(self, inputs: VllmTestParameters):
os.environ["TORCH_CUDA_ARCH_LIST"] = inputs.torch_cuda_arch_list
if not validate_cuda(get_env("TORCH_CUDA_ARCH_LIST")):
logger.warning(
"Missing supported TORCH_CUDA_ARCH_LIST. "
"Currently support TORCH_CUDA_ARCH_LIST env var "
"with supported arch [8.0, 8.9, 9.0]"
)
os.environ["HF_TOKEN"] = os.getenv("VLLM_TEST_HUGGING_FACE_TOKEN", "")
if not get_env("HF_TOKEN"):
raise ValueError(
"missing required HF_TOKEN, please set VLLM_TEST_HUGGING_FACE_TOKEN env var"
)
if not get_env("TORCH_CUDA_ARCH_LIST"):
raise ValueError(
"missing required TORCH_CUDA_ARCH_LIST, please set TORCH_CUDA_ARCH_LIST env var"
)
def preprocess_test_in(
target_file: str = "requirements/test.in", additional_packages: Iterable[str] = ()
):
"""
This modifies the target_file file in place in vllm work directory.
It removes torch and unwanted packages in target_file and replace with local torch whls
package with format "$WHEEL_PACKAGE_NAME @ file://<LOCAL_PATH>"
"""
additional_package_to_move = list(additional_packages or ())
pkgs_to_remove = [
"torch",
"torchvision",
"torchaudio",
"xformers",
"mamba_ssm",
] + additional_package_to_move
# Read current requirements
target_path = Path(target_file)
lines = target_path.read_text().splitlines()
pkgs_to_add = []
# Remove lines starting with the package names (==, @, >=) — case-insensitive
pattern = re.compile(rf"^({'|'.join(pkgs_to_remove)})\s*(==|@|>=)", re.IGNORECASE)
kept_lines = [line for line in lines if not pattern.match(line)]
# Get local installed torch/vision/audio from pip freeze
# This is hacky, but it works
pip_freeze = subprocess.check_output(["pip", "freeze"], text=True)
header_lines = [
line
for line in pip_freeze.splitlines()
if re.match(
r"^(torch|torchvision|torchaudio)\s*@\s*file://", line, re.IGNORECASE
)
]
# Write back: header_lines + blank + kept_lines
out_lines = header_lines + [""] + kept_lines
if pkgs_to_add:
out_lines += [""] + pkgs_to_add
out = "\n".join(out_lines) + "\n"
target_path.write_text(out)
logger.info("[INFO] Updated %s", target_file)
def validate_cuda(value: str) -> bool:
VALID_VALUES = {"8.0", "8.9", "9.0"}
return all(v in VALID_VALUES for v in value.split())
def check_versions():
"""
check installed packages version
"""
logger.info("Double check installed packages")
patterns = ["torch", "xformers", "torchvision", "torchaudio", "vllm"]
for pkg in patterns:
pkg_exists(pkg)
logger.info("Done. checked installed packages")

40
.ci/lumen_cli/cli/run.py Normal file
View File

@ -0,0 +1,40 @@
# main.py
import argparse
import logging
from cli.build_cli.register_build import register_build_commands
from cli.lib.common.logger import setup_logging
from cli.test_cli.register_test import register_test_commands
logger = logging.getLogger(__name__)
def main():
# Define top-level parser
parser = argparse.ArgumentParser(description="Lumos CLI")
subparsers = parser.add_subparsers(dest="command", required=True)
parser.add_argument(
"--log-level", default="INFO", help="Log level (DEBUG, INFO, WARNING, ERROR)"
)
# registers second-level subcommands
register_build_commands(subparsers)
register_test_commands(subparsers)
# parse args after all options are registered
args = parser.parse_args()
# setup global logging
setup_logging(getattr(logging, args.log_level.upper(), logging.INFO))
logger.debug("Parsed args: %s", args)
if hasattr(args, "func"):
args.func(args)
else:
parser.print_help()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,62 @@
import argparse
import logging
from cli.lib.common.cli_helper import register_targets, RichHelp, TargetSpec
from cli.lib.core.vllm.vllm_test import VllmTestRunner
logger = logging.getLogger(__name__)
# Maps targets to their argparse configuration and runner
# it adds new target to path python -m cli.run build external {target} with buildrunner
_TARGETS: dict[str, TargetSpec] = {
"vllm": {
"runner": VllmTestRunner,
"help": "test vLLM with pytorch main",
}
# add yours ...
}
def common_args(parser: argparse.ArgumentParser) -> None:
"""
Add common CLI arguments to the given parser.
"""
parser.add_argument(
"--shard-id",
type=int,
default=1,
help="a shard id to run, e.g. '0,1,2,3'",
)
parser.add_argument(
"--num-shards",
type=int,
default=1,
help="a number of shards to run, e.g. '4'",
)
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
"-tp",
"--test-plan",
type=str,
help="a pre-defined test plan to run, e.g. 'basic_correctness_test'",
)
def register_test_commands(subparsers: argparse._SubParsersAction) -> None:
build_parser = subparsers.add_parser(
"test",
help="test related commands",
formatter_class=RichHelp,
)
build_subparsers = build_parser.add_subparsers(dest="test_command", required=True)
overview = "\n".join(
f" {name:12} {spec.get('help', '')}" for name, spec in _TARGETS.items()
)
external_parser = build_subparsers.add_parser(
"external",
help="Test external targets",
description="Test third-party targets.\n\nAvailable targets:\n" + overview,
formatter_class=RichHelp,
)
register_targets(external_parser, _TARGETS, common_args=common_args)

View File

@ -0,0 +1,23 @@
[project]
name = "lumen-ci"
version = "0.1.0"
dependencies = [
"pyyaml==6.0.2",
"GitPython==3.1.45",
"docker==7.1.0",
"pytest==7.3.2",
"uv==0.8.6"
]
[tool.setuptools]
packages = ["cli"]
[tool.setuptools.package-dir]
cli = "cli"
[tool.ruff.lint]
# Enable preview mode for linting
preview = true
# Now you can select your preview rules, like RUF048
extend-select = ["RUF048"]

View File

@ -0,0 +1,47 @@
# tests/test_cli.py
import io
import sys
import unittest
from contextlib import redirect_stderr, redirect_stdout
from unittest.mock import patch
from cli.run import main
class TestArgparseCLI(unittest.TestCase):
@patch("cli.build_cli.register_build.VllmBuildRunner.run", return_value=None)
@patch("cli.build_cli.register_build.VllmBuildRunner.__init__", return_value=None)
def test_cli_run_build_external(self, mock_init, mock_run):
from cli.run import main # import after patches if needed
test_args = ["cli.run", "build", "external", "vllm"]
with patch.object(sys, "argv", test_args):
# argparse may call sys.exit on error; capture to avoid test aborts
try:
main()
except SystemExit:
pass
mock_init.assert_called_once() # got constructed
mock_run.assert_called_once_with() # run() called
def test_build_help(self):
test_args = ["cli.run", "build", "--help"]
with patch.object(sys, "argv", test_args):
stdout = io.StringIO()
stderr = io.StringIO()
# --help always raises SystemExit(0)
with self.assertRaises(SystemExit) as cm:
with redirect_stdout(stdout), redirect_stderr(stderr):
main()
self.assertEqual(cm.exception.code, 0)
output = stdout.getvalue()
self.assertIn("usage", output)
self.assertIn("external", output)
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,115 @@
import argparse
import io
import unittest
from contextlib import redirect_stderr
from unittest.mock import patch
from cli.lib.common.cli_helper import BaseRunner, register_targets, RichHelp, TargetSpec
# ---- Dummy runners for unittests----
class FooRunner(BaseRunner):
"""Foo description from docstring."""
def run(self) -> None: # replaced by mock
pass
class BarRunner(BaseRunner):
def run(self) -> None: # replaced by mock
pass
def add_foo_args(p: argparse.ArgumentParser) -> None:
p.add_argument("--x", type=int, required=True, help="x value")
def common_args(p: argparse.ArgumentParser) -> None:
p.add_argument("--verbose", action="store_true", help="verbose flag")
def build_parser(specs: dict[str, TargetSpec]) -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(prog="app", formatter_class=RichHelp)
register_targets(
parser=parser,
target_specs=specs,
common_args=common_args,
)
return parser
def get_subparser(
parser: argparse.ArgumentParser, name: str
) -> argparse.ArgumentParser:
subparsers_action = next(
a
for a in parser._subparsers._group_actions # type: ignore[attr-defined]
if isinstance(a, argparse._SubParsersAction)
)
return subparsers_action.choices[name]
class TestRegisterTargets(unittest.TestCase):
def test_metavar_lists_targets(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
"bar": {"runner": BarRunner},
}
parser = build_parser(specs)
subparsers_action = next(
a
for a in parser._subparsers._group_actions # type: ignore[attr-defined]
if isinstance(a, argparse._SubParsersAction)
)
self.assertEqual(subparsers_action.metavar, "{foo,bar}")
def test_add_arguments_and_common_args_present(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
}
parser = build_parser(specs)
foo = get_subparser(parser, "foo")
help_text = foo.format_help()
self.assertIn("--x", help_text)
self.assertIn("--verbose", help_text)
def test_runner_constructed_with_ns_and_run_called(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
}
parser = build_parser(specs)
with (
patch.object(FooRunner, "__init__", return_value=None) as mock_init,
patch.object(FooRunner, "run", return_value=None) as mock_run,
):
ns = parser.parse_args(["foo", "--x", "3", "--verbose"])
ns.func(ns) # set by register_targets
# __init__ received the Namespace
self.assertEqual(mock_init.call_count, 1)
(called_ns,), _ = mock_init.call_args
self.assertIsInstance(called_ns, argparse.Namespace)
# run() called with no args
mock_run.assert_called_once_with()
def test_runner_docstring_used_as_description_when_missing(self):
specs: dict[str, TargetSpec] = {
"foo": {"runner": FooRunner, "add_arguments": add_foo_args},
}
parser = build_parser(specs)
foo = get_subparser(parser, "foo")
help_text = foo.format_help()
self.assertIn("Foo description from docstring.", help_text)
def test_missing_target_raises_systemexit_with_usage(self):
specs: dict[str, TargetSpec] = {"foo": {"runner": FooRunner}}
parser = build_parser(specs)
buf = io.StringIO()
with self.assertRaises(SystemExit), redirect_stderr(buf):
parser.parse_args([])
err = buf.getvalue()
self.assertIn("usage:", err)
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,75 @@
import unittest
from unittest import mock
from unittest.mock import MagicMock
import docker.errors as derr
from cli.lib.common.docker_helper import _get_client, local_image_exists
class TestDockerImageHelpers(unittest.TestCase):
def setUp(self):
# Reset the singleton in the target module
patcher = mock.patch("cli.lib.common.docker_helper._docker_client", None)
self.addCleanup(patcher.stop)
patcher.start()
def test_local_image_exists_true(self):
# Mock a docker client whose images.get returns an object (no exception)
mock_client = MagicMock()
mock_client.images.get.return_value = object()
ok = local_image_exists("repo:tag", client=mock_client)
self.assertTrue(ok)
def test_local_image_exists_not_found_false(self):
mock_client = MagicMock()
# Raise docker.errors.NotFound
mock_client.images.get.side_effect = derr.NotFound("nope")
ok = local_image_exists("missing:latest", client=mock_client)
self.assertFalse(ok)
def test_local_image_exists_api_error_false(self):
mock_client = MagicMock()
mock_client.images.get.side_effect = derr.APIError("boom", None)
ok = local_image_exists("broken:tag", client=mock_client)
self.assertFalse(ok)
def test_local_image_exists_uses_lazy_singleton(self):
# Patch docker.from_env used by _get_client()
with mock.patch(
"cli.lib.common.docker_helper.docker.from_env"
) as mock_from_env:
mock_docker_client = MagicMock()
mock_from_env.return_value = mock_docker_client
# First call should create and cache the client
c1 = _get_client()
self.assertIs(c1, mock_docker_client)
mock_from_env.assert_called_once()
# Second call should reuse cached client (no extra from_env calls)
c2 = _get_client()
self.assertIs(c2, mock_docker_client)
mock_from_env.assert_called_once() # still once
def test_local_image_exists_without_client_param_calls_get_client_once(self):
# Ensure _get_client is called and cached; local_image_exists should reuse it
with mock.patch("cli.lib.common.docker_helper._get_client") as mock_get_client:
mock_client = MagicMock()
mock_get_client.return_value = mock_client
# 1st call
local_image_exists("repo:tag")
# 2nd call
local_image_exists("repo:tag2")
# local_image_exists should call _get_client each time,
# but your _get_client itself caches docker.from_env.
self.assertEqual(mock_get_client.call_count, 2)
self.assertEqual(mock_client.images.get.call_count, 2)
mock_client.images.get.assert_any_call("repo:tag")
mock_client.images.get.assert_any_call("repo:tag2")
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,149 @@
import os
import unittest
from dataclasses import dataclass
from pathlib import Path
from unittest.mock import patch
import cli.lib.common.envs_helper as m
class TestEnvHelpers(unittest.TestCase):
def setUp(self):
# Keep a copy of the original environment to restore later
self._env_backup = dict(os.environ)
def tearDown(self):
# Restore environment to original state
os.environ.clear()
os.environ.update(self._env_backup)
# -------- get_env --------
def test_get_env_unset_returns_default(self):
with patch.dict(os.environ, {}, clear=True):
self.assertEqual(m.get_env("FOO", "default"), "default")
def test_get_env_empty_returns_default(self):
with patch.dict(os.environ, {"FOO": ""}, clear=True):
self.assertEqual(m.get_env("FOO", "default"), "default")
def test_get_env_set_returns_value(self):
with patch.dict(os.environ, {"FOO": "bar"}, clear=True):
self.assertEqual(m.get_env("FOO", "default"), "bar")
def test_get_env_not_exist_returns_default(self):
with patch.dict(os.environ, {"FOO": "bar"}, clear=True):
self.assertEqual(m.get_env("TEST_NOT_EXIST", "default"), "default")
def test_get_env_not_exist_without_default(self):
with patch.dict(os.environ, {"FOO": "bar"}, clear=True):
self.assertEqual(m.get_env("TEST_NOT_EXIST"), "")
# -------- env_bool --------
def test_env_bool_uses_default_when_unset(self):
with patch.dict(os.environ, {}, clear=True):
self.assertTrue(m.env_bool("FLAG", default=True))
self.assertFalse(m.env_bool("FLAG", default=False))
def test_env_bool_uses_str2bool_when_set(self):
# Patch str2bool used by env_bool so we don't depend on its exact behavior
def fake_str2bool(s: str) -> bool:
return s.lower() in {"1", "true", "yes", "on", "y"}
with (
patch.dict(os.environ, {"FLAG": "yEs"}, clear=True),
patch.object(m, "str2bool", fake_str2bool),
):
self.assertTrue(m.env_bool("FLAG", default=False))
# -------- env_path_optional / env_path --------
def test_env_path_optional_unset_returns_none_by_default(self):
with patch.dict(os.environ, {}, clear=True):
self.assertIsNone(m.env_path_optional("P"))
def test_env_path_optional_unset_returns_none_when_env_var_is_empty(self):
with patch.dict(os.environ, {"P": ""}, clear=True):
self.assertIsNone(m.env_path_optional("P"))
def test_env_path_optional_unset_returns_default_str(self):
# default as string; resolve=True by default -> absolute path
default_str = "x/y"
with patch.dict(os.environ, {}, clear=True):
p = m.env_path_optional("P", default=default_str)
self.assertIsInstance(p, Path)
self.assertIsNotNone(p)
if p:
self.assertTrue(p.is_absolute())
self.assertEqual(p.parts[-2:], ("x", "y"))
def test_env_path_optional_unset_returns_default_path_no_resolve(self):
d = Path("z")
with patch.dict(os.environ, {}, clear=True):
p = m.env_path_optional("P", default=d, resolve=False)
self.assertEqual(p, d)
def test_env_path_optional_respects_resolve_true(self):
with patch.dict(os.environ, {"P": "a/b"}, clear=True):
p = m.env_path_optional("P", resolve=True)
self.assertIsInstance(p, Path)
if p:
self.assertTrue(p.is_absolute())
def test_env_path_optional_respects_resolve_false(self):
with patch.dict(os.environ, {"P": "rel/dir"}, clear=True):
p = m.env_path_optional("P", resolve=False)
self.assertEqual(p, Path("rel/dir"))
if p:
self.assertFalse(p.is_absolute())
def test_env_path_raises_when_missing_and_default_none(self):
with patch.dict(os.environ, {}, clear=True):
with self.assertRaises(ValueError):
m.env_path("P", None, resolve=True)
def test_env_path_returns_path_when_present(self):
tmp = Path("./b").resolve()
with patch.dict(os.environ, {"P": str(tmp)}, clear=True):
p = m.env_path("P", None, resolve=True)
self.assertEqual(p, tmp)
# -------- dataclass field helpers --------
def test_dataclass_fields_read_env_at_instantiation(self):
@dataclass
class Cfg:
flag: bool = m.env_bool_field("FLAG", default=False)
out: Path = m.env_path_field("OUT", default="ab", resolve=True)
name: str = m.env_str_field("NAME", default="anon")
# First instantiation
with patch.dict(
os.environ, {"FLAG": "true", "OUT": "outdir", "NAME": "alice"}, clear=True
):
cfg1 = Cfg()
self.assertTrue(cfg1.flag)
self.assertIsInstance(cfg1.out, Path)
self.assertTrue(cfg1.out.is_absolute())
self.assertEqual(cfg1.name, "alice")
cfg1.name = "bob" # change instance value
self.assertEqual(cfg1.name, "bob") # change is reflected
# Change env; new instance should reflect new values
with patch.dict(os.environ, {"FLAG": "false", "NAME": ""}, clear=True):
cfg2 = Cfg()
self.assertFalse(cfg2.flag) # str2bool("false") -> False
self.assertTrue("ab" in str(cfg2.out))
self.assertIsInstance(cfg2.out, Path)
self.assertTrue(cfg2.out.is_absolute())
self.assertEqual(cfg2.name, "anon") # empty -> fallback to default
def test_dataclass_path_field_with_default_value(self):
@dataclass
class C2:
out: Path = m.env_path_field("OUT", default="some/dir", resolve=False)
with patch.dict(os.environ, {}, clear=True):
c = C2()
self.assertEqual(c.out, Path("some/dir"))
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,122 @@
# test_path_utils.py
# Run: pytest -q
import os
import unittest
from pathlib import Path
from tempfile import TemporaryDirectory
from cli.lib.common.path_helper import (
copy,
ensure_dir_exists,
force_create_dir,
get_path,
is_path_exist,
remove_dir,
)
class TestPathHelper(unittest.TestCase):
def setUp(self):
self.tmpdir = TemporaryDirectory()
self.tmp_path = Path(self.tmpdir.name)
def tearDown(self):
self.tmpdir.cleanup()
# -------- get_path --------
def test_get_path_returns_path_for_str(self):
# Use relative path to avoid absolute-ness
rel_str = "sub/f.txt"
os.chdir(self.tmp_path)
p = get_path(rel_str, resolve=False)
self.assertIsInstance(p, Path)
self.assertFalse(p.is_absolute())
self.assertEqual(str(p), rel_str)
def test_get_path_resolves(self):
rel_str = "sub/f.txt"
p = get_path(str(self.tmp_path / rel_str), resolve=True)
self.assertTrue(p.is_absolute())
self.assertTrue(str(p).endswith(rel_str))
def test_get_path_with_path_input(self):
p_in = self.tmp_path / "sub/f.txt"
p_out = get_path(p_in, resolve=False)
self.assertTrue(str(p_out) == str(p_in))
def test_get_path_with_none_raises(self):
with self.assertRaises(ValueError):
get_path(None) # type: ignore[arg-type]
def test_get_path_invalid_type_raises(self):
with self.assertRaises(TypeError):
get_path(123) # type: ignore[arg-type]
# -------- ensure_dir_exists / force_create_dir / remove_dir --------
def test_ensure_dir_exists_creates_and_is_idempotent(self):
d = self.tmp_path / "made"
ensure_dir_exists(d)
self.assertTrue(d.exists() and d.is_dir())
ensure_dir_exists(d)
def test_force_create_dir_clears_existing(self):
d = self.tmp_path / "fresh"
(d / "inner").mkdir(parents=True)
(d / "inner" / "f.txt").write_text("x")
force_create_dir(d)
self.assertTrue(d.exists())
self.assertEqual(list(d.iterdir()), [])
def test_remove_dir_none_is_noop(self):
remove_dir(None) # type: ignore[arg-type]
def test_remove_dir_nonexistent_is_noop(self):
ghost = self.tmp_path / "ghost"
remove_dir(ghost)
def test_remove_dir_accepts_str(self):
d = self.tmp_path / "to_rm"
d.mkdir()
remove_dir(str(d))
self.assertFalse(d.exists())
# -------- copy --------
def test_copy_file_to_file(self):
src = self.tmp_path / "src.txt"
dst = self.tmp_path / "out" / "dst.txt"
src.write_text("hello")
copy(src, dst)
self.assertEqual(dst.read_text(), "hello")
def test_copy_dir_to_new_dir(self):
src = self.tmp_path / "srcdir"
(src / "a").mkdir(parents=True)
(src / "a" / "f.txt").write_text("content")
dst = self.tmp_path / "destdir"
copy(src, dst)
self.assertEqual((dst / "a" / "f.txt").read_text(), "content")
def test_copy_dir_into_existing_dir_overwrite_true_merges(self):
src = self.tmp_path / "srcdir"
dst = self.tmp_path / "destdir"
(src / "x").mkdir(parents=True)
(src / "x" / "new.txt").write_text("new")
dst.mkdir()
(dst / "existing.txt").write_text("old")
copy(src, dst)
self.assertEqual((dst / "existing.txt").read_text(), "old")
self.assertEqual((dst / "x" / "new.txt").read_text(), "new")
def test_is_str_path_exist(self):
p = self.tmp_path / "x.txt"
p.write_text("1")
self.assertTrue(is_path_exist(str(p)))
self.assertTrue(is_path_exist(p))
self.assertFalse(is_path_exist(str(self.tmp_path / "missing")))
self.assertFalse(is_path_exist(self.tmp_path / "missing"))
self.assertFalse(is_path_exist(""))
if __name__ == "__main__":
unittest.main()

View File

@ -0,0 +1,185 @@
# tests/test_run_test_plan.py
import importlib
from contextlib import nullcontext
from types import SimpleNamespace
from unittest.mock import MagicMock
import pytest
MOD = "cli.lib.core.vllm.lib"
# We import inside tests so the MOD override above applies everywhere
run_test_plan_import_path = f"{MOD}.run_test_plan"
def _get_cmd(c):
# Support both kwargs and positional args
return c.kwargs.get("cmd", c.args[0] if c.args else None)
def _get_check(c):
if "check" in c.kwargs:
return c.kwargs["check"]
# If positional, assume second arg is 'check' when present; default False
return c.args[1] if len(c.args) > 1 else False
@pytest.fixture
def patch_module(monkeypatch):
"""
Patch helpers ('pip_install_packages', 'temp_environ', 'working_directory',
'run_command', 'logger') inside the target module and expose them.
"""
module = importlib.import_module(MOD)
# Create fakes/mocks
pip_install_packages = MagicMock(name="pip_install_packages")
run_command = MagicMock(name="run_command", return_value=0)
# temp_environ / working_directory: record calls but act as context managers
temp_calls: list[dict] = []
workdir_calls: list[str] = []
def fake_working_directory(path: str):
workdir_calls.append(path)
return nullcontext()
def fake_temp_env(map: dict[str, str]):
temp_calls.append(map)
return nullcontext()
logger = SimpleNamespace(
info=MagicMock(name="logger.info"),
error=MagicMock(name="logger.error"),
)
# Apply patches (raise if attribute doesn't exist)
monkeypatch.setattr(
module, "pip_install_packages", pip_install_packages, raising=True
)
monkeypatch.setattr(module, "run_command", run_command, raising=True)
monkeypatch.setattr(
module, "working_directory", fake_working_directory, raising=True
)
monkeypatch.setattr(module, "temp_environ", fake_temp_env, raising=True)
monkeypatch.setattr(module, "logger", logger, raising=True)
return SimpleNamespace(
module=module,
run_test_plan=module.run_test_plan, # expose to avoid getattr("constant") (Ruff B009)
pip_install_packages=pip_install_packages,
run_command=run_command,
temp_calls=temp_calls,
workdir_calls=workdir_calls,
logger=logger,
)
def test_success_runs_all_steps_and_uses_env_and_workdir(monkeypatch, patch_module):
run_test_plan = patch_module.run_test_plan
tests_map = {
"basic": {
"title": "Basic suite",
"package_install": [],
"working_directory": "tests",
"env_vars": {"GLOBAL_FLAG": "1"},
"steps": [
"export A=x && pytest -q",
"export B=y && pytest -q tests/unit",
],
}
}
# One exit code per step (export + two pytest)
patch_module.run_command.side_effect = [0, 0, 0]
run_test_plan("basic", "cpu", tests_map)
calls = patch_module.run_command.call_args_list
cmds = [_get_cmd(c) for c in calls]
checks = [_get_check(c) for c in calls]
assert cmds == [
"export A=x && pytest -q",
"export B=y && pytest -q tests/unit",
]
assert all(chk is False for chk in checks)
assert patch_module.workdir_calls == ["tests"]
assert patch_module.temp_calls == [{"GLOBAL_FLAG": "1"}]
def test_installs_packages_when_present(monkeypatch, patch_module):
run_test_plan = patch_module.module.run_test_plan
tests_map = {
"with_pkgs": {
"title": "Needs deps",
"package_install": ["timm==1.0.0", "flash-attn"],
"steps": ["pytest -q"],
}
}
patch_module.run_command.return_value = 0
run_test_plan("with_pkgs", "gpu", tests_map)
patch_module.pip_install_packages.assert_called_once_with(
packages=["timm==1.0.0", "flash-attn"],
prefer_uv=True,
)
def test_raises_on_missing_plan(patch_module):
run_test_plan = patch_module.module.run_test_plan
with pytest.raises(RuntimeError) as ei:
run_test_plan("nope", "cpu", tests_map={})
assert "test nope not found" in str(ei.value)
def test_aggregates_failures_and_raises(monkeypatch, patch_module):
run_test_plan = patch_module.module.run_test_plan
tests_map = {
"mix": {
"title": "Some pass some fail",
"steps": [
"pytest test_a.py", # 0 → pass
"pytest test_b.py", # 1 → fail
"pytest test_c.py", # 2 → fail
],
}
}
# Simulate pass, fail, fail
patch_module.run_command.side_effect = [0, 1, 2]
with pytest.raises(RuntimeError) as ei:
run_test_plan("mix", "cpu", tests_map)
msg = str(ei.value)
assert "2 pytest runs failed" in msg
# Ensure logger captured failed tests list
patch_module.logger.error.assert_called_once()
# And we attempted all three commands
assert patch_module.run_command.call_count == 3
def test_custom_working_directory_used(patch_module):
run_test_plan = patch_module.module.run_test_plan
tests_map = {
"customwd": {
"title": "Custom wd",
"working_directory": "examples/ci",
"steps": ["pytest -q"],
}
}
patch_module.run_command.return_value = 0
run_test_plan("customwd", "cpu", tests_map)
assert patch_module.workdir_calls == ["examples/ci"]

View File

@ -0,0 +1,143 @@
import os
import tempfile
import unittest
from pathlib import Path
from cli.lib.common.utils import temp_environ, working_directory # <-- replace import
class EnvIsolatedTestCase(unittest.TestCase):
"""Base class that snapshots os.environ and CWD for isolation."""
def setUp(self):
import os
import tempfile
self._env_backup = dict(os.environ)
# Snapshot/repair CWD if it's gone
try:
self._cwd_backup = os.getcwd()
except FileNotFoundError:
# If CWD no longer exists, switch to a safe place and record that
self._cwd_backup = tempfile.gettempdir()
os.chdir(self._cwd_backup)
# Create a temporary directory for the test to run in
self._temp_dir = tempfile.mkdtemp()
os.chdir(self._temp_dir)
def tearDown(self):
import os
import shutil
import tempfile
# Restore cwd first (before cleaning up temp dir)
try:
os.chdir(self._cwd_backup)
except OSError:
os.chdir(tempfile.gettempdir())
# Clean up temporary directory
try:
shutil.rmtree(self._temp_dir, ignore_errors=True)
except Exception:
pass # Ignore cleanup errors
# Restore env
to_del = set(os.environ.keys()) - set(self._env_backup.keys())
for k in to_del:
os.environ.pop(k, None)
for k, v in self._env_backup.items():
os.environ[k] = v
class TestTempEnviron(EnvIsolatedTestCase):
def test_sets_and_restores_new_var(self):
var = "TEST_TMP_ENV_NEW"
self.assertNotIn(var, os.environ)
with temp_environ({var: "123"}):
self.assertEqual(os.environ[var], "123")
self.assertNotIn(var, os.environ) # removed after exit
def test_overwrites_and_restores_existing_var(self):
var = "TEST_TMP_ENV_OVERWRITE"
os.environ[var] = "orig"
with temp_environ({var: "override"}):
self.assertEqual(os.environ[var], "override")
self.assertEqual(os.environ[var], "orig") # restored
def test_multiple_vars_and_missing_cleanup(self):
v1, v2 = "TEST_ENV_V1", "TEST_ENV_V2"
os.environ.pop(v1, None)
os.environ[v2] = "keep"
with temp_environ({v1: "a", v2: "b"}):
self.assertEqual(os.environ[v1], "a")
self.assertEqual(os.environ[v2], "b")
self.assertNotIn(v1, os.environ) # newly-added -> removed
self.assertEqual(os.environ[v2], "keep") # pre-existing -> restored
def test_restores_even_on_exception(self):
var = "TEST_TMP_ENV_EXCEPTION"
self.assertNotIn(var, os.environ)
with self.assertRaises(RuntimeError):
with temp_environ({var: "x"}):
self.assertEqual(os.environ[var], "x")
raise RuntimeError("boom")
self.assertNotIn(var, os.environ) # removed after exception
class TestWorkingDirectory(EnvIsolatedTestCase):
def test_changes_and_restores(self):
start = Path.cwd()
with tempfile.TemporaryDirectory() as td:
target = Path(td) / "wd"
target.mkdir()
with working_directory(str(target)):
self.assertEqual(Path.cwd().resolve(), target.resolve())
self.assertEqual(Path.cwd(), start)
def test_noop_when_empty_path(self):
start = Path.cwd()
with working_directory(""):
self.assertEqual(Path.cwd(), start)
self.assertEqual(Path.cwd(), start)
def test_restores_on_exception(self):
start = Path.cwd()
with tempfile.TemporaryDirectory() as td:
target = Path(td) / "wd_exc"
target.mkdir()
with self.assertRaises(ValueError):
with working_directory(str(target)):
# Normalize both sides to handle /var -> /private/var
self.assertEqual(Path.cwd().resolve(), target.resolve())
raise ValueError("boom")
self.assertEqual(Path.cwd().resolve(), start.resolve())
def test_raises_for_missing_dir(self):
start = Path.cwd()
with tempfile.TemporaryDirectory() as td:
missing = Path(td) / "does_not_exist"
with self.assertRaises(FileNotFoundError):
# os.chdir should raise before yielding
with working_directory(str(missing)):
pass
self.assertEqual(Path.cwd(), start)
if __name__ == "__main__":
unittest.main(verbosity=2)

View File

@ -0,0 +1,176 @@
import os
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch
import cli.lib.core.vllm.vllm_build as vllm_build
_VLLM_BUILD_MODULE = "cli.lib.core.vllm.vllm_build"
class TestVllmBuildParameters(unittest.TestCase):
@patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=True)
@patch(
"cli.lib.common.envs_helper.env_path_optional",
side_effect=lambda name, default=None, resolve=True: {
"DOCKERFILE_PATH": Path("/abs/vllm/Dockerfile"),
"TORCH_WHEELS_PATH": Path("/abs/dist"),
"OUTPUT_DIR": Path("/abs/shared"),
}.get(name, Path(default) if default is not None else None),
)
@patch.dict(
os.environ,
{
"USE_TORCH_WHEEL": "1",
"USE_LOCAL_BASE_IMAGE": "1",
"USE_LOCAL_DOCKERFILE": "1",
"BASE_IMAGE": "my/image:tag",
"DOCKERFILE_PATH": "vllm/Dockerfile",
"TORCH_WHEELS_PATH": "dist",
"OUTPUT_DIR": "shared",
},
clear=True,
)
def test_params_success_normalizes_and_validates(
self, mock_env_path, mock_is_path, mock_local_img
):
params = vllm_build.VllmBuildParameters()
self.assertEqual(params.torch_whls_path, Path("/abs/dist"))
self.assertEqual(params.dockerfile_path, Path("/abs/vllm/Dockerfile"))
self.assertEqual(params.output_dir, Path("/abs/shared"))
self.assertEqual(params.base_image, "my/image:tag")
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)
@patch.dict(
os.environ, {"USE_TORCH_WHEEL": "1", "TORCH_WHEELS_PATH": "dist"}, clear=True
)
def test_params_missing_torch_whls_raises(self, _is_path):
with tempfile.TemporaryDirectory() as td:
os.chdir(td)
with self.assertRaises(ValueError) as cm:
vllm_build.VllmBuildParameters(
use_local_base_image=False,
use_local_dockerfile=False,
)
err = cm.exception
self.assertIn("TORCH_WHEELS_PATH", str(err))
@patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=False)
@patch.dict(
os.environ, {"USE_LOCAL_BASE_IMAGE": "1", "BASE_IMAGE": "img:tag"}, clear=True
)
def test_params_missing_local_base_image_raises(self, _local_img):
with tempfile.TemporaryDirectory() as td:
os.chdir(td)
with self.assertRaises(ValueError) as cm:
vllm_build.VllmBuildParameters(
use_torch_whl=False,
use_local_dockerfile=False,
)
err = cm.exception
self.assertIn("BASE_IMAGE", str(err))
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)
@patch.dict(
os.environ,
{"USE_LOCAL_DOCKERFILE": "1", "DOCKERFILE_PATH": "Dockerfile"},
clear=True,
)
def test_params_missing_dockerfile_raises(self, _is_path):
with tempfile.TemporaryDirectory() as td:
os.chdir(td)
with self.assertRaises(ValueError) as cm:
vllm_build.VllmBuildParameters(
use_torch_whl=False,
use_local_base_image=False,
)
err = cm.exception
self.assertIn("DOCKERFILE_PATH", str(err))
@patch(f"{_VLLM_BUILD_MODULE}.is_path_exist", return_value=False)
@patch.dict(
os.environ,
{"OUTPUT_DIR": ""},
clear=True,
)
def test_params_missing_output_dir(self, _is_path):
with self.assertRaises(FileNotFoundError):
vllm_build.VllmBuildParameters()
class TestBuildCmdAndRun(unittest.TestCase):
@patch(f"{_VLLM_BUILD_MODULE}.local_image_exists", return_value=True)
def test_generate_docker_build_cmd_includes_bits(self, _exists):
runner = vllm_build.VllmBuildRunner()
inputs = MagicMock()
inputs.output_dir = Path("/abs/out")
inputs.use_local_base_image = True
inputs.base_image = "img:tag"
inputs.torch_whls_path = Path("./vllm/tmp")
inputs.max_jobs = 64
inputs.cuda_version = "12.8.1"
inputs.python_version = "3.12"
inputs.sccache_bucket = "my-bucket"
inputs.sccache_region = "us-west-2"
inputs.torch_cuda_arch_list = "8.0;9.0"
inputs.target_stage = "export-wheels"
inputs.tag_name = "vllm-wheels"
cmd = runner._generate_docker_build_cmd(inputs)
squashed = " ".join(cmd.split())
self.assertIn("--output type=local,dest=/abs/out", squashed)
self.assertIn("-f docker/Dockerfile.nightly_torch", squashed)
self.assertIn("--pull=false", squashed)
self.assertIn("--build-arg TORCH_WHEELS_PATH=tmp", squashed)
self.assertIn("--build-arg BUILD_BASE_IMAGE=img:tag", squashed)
self.assertIn("--build-arg FINAL_BASE_IMAGE=img:tag", squashed)
self.assertIn("--build-arg max_jobs=64", squashed)
self.assertIn("--build-arg CUDA_VERSION=12.8.1", squashed)
self.assertIn("--build-arg PYTHON_VERSION=3.12", squashed)
self.assertIn("--build-arg USE_SCCACHE=1", squashed)
self.assertIn("--build-arg SCCACHE_BUCKET_NAME=my-bucket", squashed)
self.assertIn("--build-arg SCCACHE_REGION_NAME=us-west-2", squashed)
self.assertIn("--build-arg torch_cuda_arch_list='8.0;9.0'", squashed)
self.assertIn("--target export-wheels", squashed)
self.assertIn("-t vllm-wheels", squashed)
@patch(f"{_VLLM_BUILD_MODULE}.run_command")
@patch(f"{_VLLM_BUILD_MODULE}.ensure_dir_exists")
@patch(f"{_VLLM_BUILD_MODULE}.clone_vllm")
@patch.object(
vllm_build.VllmBuildRunner,
"_generate_docker_build_cmd",
return_value="docker buildx ...",
)
@patch.dict(
os.environ,
{
"USE_TORCH_WHEEL": "0",
"USE_LOCAL_BASE_IMAGE": "0",
"USE_LOCAL_DOCKERFILE": "0",
"OUTPUT_DIR": "shared",
},
clear=True,
)
def test_run_calls_clone_prepare_and_build(
self, mock_gen, mock_clone, mock_ensure, mock_run
):
params = MagicMock()
params.output_dir = Path("shared")
params.use_local_dockerfile = False
params.use_torch_whl = False
with patch(f"{_VLLM_BUILD_MODULE}.VllmBuildParameters", return_value=params):
runner = vllm_build.VllmBuildRunner()
runner.run()
mock_clone.assert_called_once()
mock_ensure.assert_called_once_with(Path("shared"))
mock_gen.assert_called_once_with(params)
mock_run.assert_called_once()
_, kwargs = mock_run.call_args
assert kwargs.get("cwd") == "vllm"

View File

@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma/build_magma.sh
.PHONY: all
all: magma-cuda130
all: magma-cuda129
all: magma-cuda128
all: magma-cuda126
@ -25,6 +26,12 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda130
magma-cuda130: DESIRED_CUDA := 13.0
magma-cuda130: CUDA_ARCH_LIST := -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda130:
$(DOCKER_RUN)
.PHONY: magma-cuda129
magma-cuda129: DESIRED_CUDA := 12.9
magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120

View File

@ -28,6 +28,7 @@ pushd ${PACKAGE_DIR}/magma-${MAGMA_VERSION}
patch < ${PACKAGE_FILES}/CMake.patch
patch < ${PACKAGE_FILES}/cmakelists.patch
patch -p0 < ${PACKAGE_FILES}/thread_queue.patch
patch -p1 < ${PACKAGE_FILES}/cuda13.patch
patch -p1 < ${PACKAGE_FILES}/getrf_shfl.patch
patch -p1 < ${PACKAGE_FILES}/getrf_nbparam.patch
# The build.sh script expects to be executed from the sources root folder
@ -37,6 +38,7 @@ popd
# Package recipe, license and tarball
# Folder and package name are backward compatible for the build workflow
cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh
cp ${PACKAGE_FILES}/cuda13.patch ${PACKAGE_RECIPE}/cuda13.patch
cp ${PACKAGE_FILES}/thread_queue.patch ${PACKAGE_RECIPE}/thread_queue.patch
cp ${PACKAGE_FILES}/cmakelists.patch ${PACKAGE_RECIPE}/cmakelists.patch
cp ${PACKAGE_FILES}/getrf_shfl.patch ${PACKAGE_RECIPE}/getrf_shfl.patch

View File

@ -0,0 +1,26 @@
diff --git a/interface_cuda/interface.cpp b/interface_cuda/interface.cpp
index 73fed1b20..e77519bfe 100644
--- a/interface_cuda/interface.cpp
+++ b/interface_cuda/interface.cpp
@@ -438,14 +438,20 @@ magma_print_environment()
cudaDeviceProp prop;
err = cudaGetDeviceProperties( &prop, dev );
check_error( err );
+ #ifdef MAGMA_HAVE_CUDA
+#if CUDA_VERSION < 13000
printf( "%% device %d: %s, %.1f MHz clock, %.1f MiB memory, capability %d.%d\n",
dev,
prop.name,
prop.clockRate / 1000.,
+#else
+ printf( "%% device %d: %s, ??? MHz clock, %.1f MiB memory, capability %d.%d\n",
+ dev,
+ prop.name,
+#endif
prop.totalGlobalMem / (1024.*1024.),
prop.major,
prop.minor );
- #ifdef MAGMA_HAVE_CUDA
int arch = prop.major*100 + prop.minor*10;
if ( arch < MAGMA_CUDA_ARCH_MIN ) {
printf("\n"

View File

@ -5,10 +5,6 @@ set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
case "${GPU_ARCH_TYPE:-BLANK}" in
BLANK)
# Legacy behavior for CircleCI
bash "${SCRIPTPATH}/build_cuda.sh"
;;
cuda)
bash "${SCRIPTPATH}/build_cuda.sh"
;;

View File

@ -138,28 +138,11 @@ fi
echo "Calling setup.py bdist at $(date)"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
fi
echo "Finished setup.py bdist at $(date)"
# Build libtorch packages
@ -272,10 +255,6 @@ ls /tmp/$WHEELHOUSE_DIR
mkdir -p "/$WHEELHOUSE_DIR"
mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true
fi
if [[ -n "$BUILD_PYTHONLESS" ]]; then
mkdir -p /$LIBTORCH_HOUSE_DIR
mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR
@ -452,16 +431,8 @@ if [[ -z "$BUILD_PYTHONLESS" ]]; then
pushd $PYTORCH_ROOT/test
# Install the wheel for this Python version
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true
fi
pip uninstall -y "$TORCH_PACKAGE_NAME"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
fi
pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
# Print info on the libraries installed in this wheel

View File

@ -66,6 +66,9 @@ case ${CUDA_VERSION} in
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
13.0)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"
;;
12.6)
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
;;
@ -110,13 +113,18 @@ DEPS_SONAME=(
)
# CUDA_VERSION 12.6, 12.8, 12.9
if [[ $CUDA_VERSION == 12* ]]; then
# CUDA_VERSION 12.*, 13.*
if [[ $CUDA_VERSION == 12* || $CUDA_VERSION == 13* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ $CUDA_VERSION == 13* ]]; then
export TORCH_NVCC_FLAGS="$TORCH_NVCC_FLAGS -compress-mode=size"
fi
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
@ -126,15 +134,11 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12"
"/usr/local/cuda/lib64/libnvshmem_host.so.3"
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so"
)
DEPS_SONAME+=(
@ -146,41 +150,83 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.12"
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libnvshmem_host.so.3"
"libcufile.so.0"
"libcufile_rdma.so.1"
"libcupti.so.12"
"libnvperf_host.so"
)
# Add libnvToolsExt only if CUDA version is not 12.9
if [[ $CUDA_VERSION != 12.9* ]]; then
DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")
DEPS_SONAME+=("libnvToolsExt.so.1")
if [[ $CUDA_VERSION == 13* ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcublas.so.13"
"/usr/local/cuda/lib64/libcublasLt.so.13"
"/usr/local/cuda/lib64/libcudart.so.13"
"/usr/local/cuda/lib64/libnvrtc.so.13"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13"
"/usr/local/cuda/lib64/libibverbs.so.1"
"/usr/local/cuda/lib64/librdmacm.so.1"
"/usr/local/cuda/lib64/libmlx5.so.1"
"/usr/local/cuda/lib64/libnl-3.so.200"
"/usr/local/cuda/lib64/libnl-route-3.so.200")
DEPS_SONAME+=(
"libcublas.so.13"
"libcublasLt.so.13"
"libcudart.so.13"
"libnvrtc.so.13"
"libcupti.so.13"
"libibverbs.so.1"
"librdmacm.so.1"
"libmlx5.so.1"
"libnl-3.so.200"
"libnl-route-3.so.200")
export USE_CUPTI_SO=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUFILE=0
else
DEPS_LIST+=(
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12")
DEPS_SONAME+=(
"libnvToolsExt.so.1"
"libcublas.so.12"
"libcublasLt.so.12"
"libcudart.so.12"
"libnvrtc.so.12"
"libcupti.so.12")
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/cusparselt/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvshmem/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
'$ORIGIN/../../nvidia/cufile/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/cusparselt/lib'
)
if [[ $CUDA_VERSION == 13* ]]; then
CUDA_RPATHS+=('$ORIGIN/../../nvidia/cu13/lib')
else
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

View File

@ -194,7 +194,7 @@ ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $ROCBLAS_OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library

View File

@ -25,6 +25,7 @@ source /opt/intel/oneapi/mpi/latest/env/vars.sh
export USE_STATIC_MKL=1
export USE_ONEMKL=1
export USE_XCCL=1
export USE_MPI=0
WHEELHOUSE_DIR="wheelhousexpu"
LIBTORCH_HOUSE_DIR="libtorch_housexpu"

View File

@ -50,9 +50,6 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export ATEN_THREADING=NATIVE
fi
# Enable LLVM dependency for TensorExpr testing
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with
@ -95,6 +92,27 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export ACL_ROOT_DIR=/ComputeLibrary
fi
if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then
if [[ -f /opt/riscv-cross-env/bin/activate ]]; then
# shellcheck disable=SC1091
source /opt/riscv-cross-env/bin/activate
else
echo "Activation file not found"
exit 1
fi
export CMAKE_CROSSCOMPILING=TRUE
export CMAKE_SYSTEM_NAME=Linux
export CMAKE_SYSTEM_PROCESSOR=riscv64
export USE_CUDA=0
export USE_MKLDNN=0
export SLEEF_TARGET_EXEC_USE_QEMU=ON
sudo chown -R jenkins /var/lib/jenkins/workspace /opt
fi
if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
POSSIBLE_JAVA_HOMES=()
POSSIBLE_JAVA_HOMES+=(/usr/local)
@ -155,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Enable XCCL build
export USE_XCCL=1
export USE_MPI=0
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
@ -176,8 +195,16 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && echo "${TORCH_CUDA_ARCH_LIST}" | tr ' ' '\n' | sed 's/$/>= 8.0/' | bc | grep -q 1; then
J=2 # default to 2 jobs
case "$RUNNER" in
linux.12xlarge.memory|linux.24xlarge.memory)
J=24
;;
esac
echo "Building FlashAttention with job limit $J"
export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j ${J}"
fi
if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
@ -192,7 +219,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export USE_ASAN=1
export REL_WITH_DEB_INFO=1
export UBSAN_FLAGS="-fno-sanitize-recover=all"
unset USE_LLVM
fi
if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then
@ -213,7 +239,7 @@ fi
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -258,29 +284,19 @@ else
# XLA test build fails when WERROR=1
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *xla* && "$BUILD_ENVIRONMENT" != *riscv64* ]]; then
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install numpy==2.0.2
WERROR=1 python setup.py clean
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
python3 tools/packaging/split_wheel.py bdist_wheel
else
WERROR=1 python setup.py bdist_wheel
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py clean
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "USE_SPLIT_BUILD cannot be used with xla or rocm"
exit 1
else
python setup.py bdist_wheel
fi
python setup.py bdist_wheel
fi
pip_install_whl "$(echo dist/*.whl)"
@ -405,7 +421,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
# don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build
python tools/stats/export_test_times.py
fi
# don't do this for bazel or s390x as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
# don't do this for bazel or s390x or riscv64 as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
print_sccache_stats
fi

View File

@ -67,7 +67,7 @@ fi
# wheels with cxx11-abi
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' ]]; then
if [[ "$(uname)" != 'Darwin' && "$(uname -m)" != "s390x" ]]; then
# We also check that there are cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"
@ -300,24 +300,3 @@ except RuntimeError as e:
exit 1
fi
fi
###############################################################################
# Check for C++ ABI compatibility to GCC-11 - GCC 13
###############################################################################
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
pushd /tmp
# Per https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html
# gcc-11 is ABI16, gcc-13 is ABI18, gcc-14 is ABI19
# gcc 11 - CUDA 11.8, xpu, rocm
# gcc 13 - CUDA 12.6, 12.8 and cpu
# Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426
if [[ "$(uname -m)" == "s390x" ]]; then
cxx_abi="19"
elif [[ "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then
cxx_abi="18"
else
cxx_abi="16"
fi
python -c "import torch; exit(0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi10${cxx_abi}' else 1)"
popd
fi

View File

@ -149,6 +149,19 @@ function get_pinned_commit() {
cat .github/ci_commit_pins/"${1}".txt
}
function detect_cuda_arch() {
if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then
if command -v nvidia-smi; then
TORCH_CUDA_ARCH_LIST=$(nvidia-smi --query-gpu=compute_cap --format=csv | tail -n 1)
elif [[ "${TEST_CONFIG}" == *nogpu* ]]; then
# There won't be nvidia-smi in nogpu tests, so just set TORCH_CUDA_ARCH_LIST to the default
# minimum supported value here
TORCH_CUDA_ARCH_LIST=8.0
fi
export TORCH_CUDA_ARCH_LIST
fi
}
function install_torchaudio() {
local commit
commit=$(get_pinned_commit audio)
@ -229,7 +242,6 @@ function install_torchrec_and_fbgemm() {
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
pushd /tmp
local wheel_dir=dist/fbgemm_gpu
local found_whl=0
@ -245,7 +257,7 @@ function install_torchrec_and_fbgemm() {
if [ "${found_whl}" == "0" ]; then
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
git checkout "${fbgemm_commit}" --recurse-submodules
python setup.py bdist_wheel \
--build-variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
@ -264,7 +276,6 @@ function install_torchrec_and_fbgemm() {
done
rm -rf fbgemm
popd
else
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu
@ -273,7 +284,7 @@ function install_torchrec_and_fbgemm() {
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.9 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"
@ -283,30 +294,6 @@ function clone_pytorch_xla() {
fi
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
function install_torchao() {
local commit
commit=$(get_pinned_commit torchao)

View File

@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
time make VERBOSE=1 html -j
time make VERBOSE=1 html
popd
popd

View File

@ -157,6 +157,34 @@ test_jit_hooks() {
assert_git_not_dirty
}
# Shellcheck doesn't like it when you pass no arguments to a function
# that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2120
checkout_install_torchbench() {
local commit
commit=$(cat .ci/docker/ci_commit_pins/torchbench.txt)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
popd
pip install -r .ci/docker/ci_commit_pins/huggingface-requirements.txt
# https://github.com/pytorch/pytorch/issues/160689 to remove torchao because
# its current version 0.12.0 doesn't work with transformers 4.54.0
pip uninstall -y torchao
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
}
torchbench_setup_macos() {
git clone --recursive https://github.com/pytorch/vision torchvision
git clone --recursive https://github.com/pytorch/audio torchaudio
@ -167,7 +195,7 @@ torchbench_setup_macos() {
git checkout "$(cat ../.github/ci_commit_pins/vision.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
python -m pip install -e . -v --no-build-isolation
popd
pushd torchaudio
@ -176,11 +204,9 @@ torchbench_setup_macos() {
git submodule update --init --recursive
python setup.py clean
#TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp
USE_OPENMP=0 python setup.py develop
USE_OPENMP=0 python -m pip install -e . -v --no-build-isolation
popd
# Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2119,SC2120
checkout_install_torchbench
}
@ -276,6 +302,47 @@ test_torchbench_smoketest() {
fi
done
echo "Pytorch benchmark on mps device completed"
}
test_aoti_torchbench_smoketest() {
print_cmake_info
echo "Launching AOTInductor torchbench setup"
pip_benchmark_deps
# shellcheck disable=SC2119,SC2120
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--accuracy --only "$model" --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
done
echo "Launching HuggingFace inference performance run for AOT Inductor and dtype ${dtype}"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --export-aot-inductor --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/aot_inductor_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
echo "Pytorch benchmark on mps device completed"
}
@ -324,6 +391,8 @@ elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest "${SHARD_NUMBER}"
elif [[ $TEST_CONFIG == *"aot_inductor_perf_smoketest"* ]]; then
test_aoti_torchbench_smoketest "${SHARD_NUMBER}"
elif [[ $TEST_CONFIG == *"mps"* ]]; then
test_python_mps
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

View File

@ -45,6 +45,7 @@ if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then
# DTensor tests
time python test/run_test.py --verbose -i distributed/tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/tensor/test_dtensor_compile
time python test/run_test.py --verbose -i distributed/tensor/test_utils.py
# DeviceMesh test
time python test/run_test.py --verbose -i distributed/test_device_mesh

View File

@ -0,0 +1,25 @@
From 6e08c9d08e9de59c7af28b720289debbbd384764 Mon Sep 17 00:00:00 2001
From: Michael Wang <13521008+isVoid@users.noreply.github.com>
Date: Tue, 1 Apr 2025 17:28:05 -0700
Subject: [PATCH] Avoid bumping certain driver API to avoid future breakage
(#185)
Co-authored-by: isVoid <isVoid@users.noreply.github.com>
---
numba_cuda/numba/cuda/cudadrv/driver.py | 3 +++
1 file changed, 3 insertions(+)
diff --git a/numba_cuda/numba/cuda/cudadrv/driver.py b/numba_cuda/numba/cuda/cudadrv/driver.py
index 1641bf77..233e9ed7 100644
--- a/numba_cuda/numba/cuda/cudadrv/driver.py
+++ b/numba_cuda/numba/cuda/cudadrv/driver.py
@@ -365,6 +365,9 @@ def _find_api(self, fname):
else:
variants = ('_v2', '')
+ if fname in ("cuCtxGetDevice", "cuCtxSynchronize"):
+ return getattr(self.lib, fname)
+
for variant in variants:
try:
return getattr(self.lib, f'{fname}{variant}')

View File

@ -32,6 +32,9 @@ LIBTORCH_NAMESPACE_LIST = (
"torch::",
)
# Patterns for detecting statically linked libstdc++ symbols
STATICALLY_LINKED_CXX11_ABI = [re.compile(r".*recursive_directory_iterator.*")]
def _apply_libtorch_symbols(symbols):
return [
@ -53,12 +56,17 @@ def get_symbols(lib: str) -> list[tuple[str, str, str]]:
return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]
def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
def grep_symbols(
lib: str, patterns: list[Any], symbol_type: str | None = None
) -> list[str]:
def _grep_symbols(
symbols: list[tuple[str, str, str]], patterns: list[Any]
) -> list[str]:
rc = []
for _s_addr, _s_type, s_name in symbols:
# Filter by symbol type if specified
if symbol_type and _s_type != symbol_type:
continue
for pattern in patterns:
if pattern.match(s_name):
rc.append(s_name)
@ -80,6 +88,18 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
return functools.reduce(list.__add__, (x.result() for x in tasks), [])
def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:
cxx11_statically_linked_symbols = grep_symbols(
lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="T"
)
num_statically_linked_symbols = len(cxx11_statically_linked_symbols)
print(f"num_statically_linked_symbols (T): {num_statically_linked_symbols}")
if num_statically_linked_symbols > 0:
raise RuntimeError(
f"Found statically linked libstdc++ symbols (recursive_directory_iterator): {cxx11_statically_linked_symbols[:100]}"
)
def check_lib_symbols_for_abi_correctness(lib: str) -> None:
print(f"lib: {lib}")
cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)
@ -107,6 +127,7 @@ def main() -> None:
libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")
check_lib_symbols_for_abi_correctness(libtorch_cpu_path)
check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
if __name__ == "__main__":

View File

@ -32,6 +32,16 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /v
git config --global --add safe.directory /var/lib/jenkins/workspace
fi
# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
fi
echo "Environment variables:"
env
@ -91,6 +101,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export VALGRIND=OFF
fi
detect_cuda_arch
if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then
# There are additional warnings on s390x, maybe due to newer gcc.
@ -495,6 +506,14 @@ test_inductor_cpp_wrapper_shard() {
-k 'take' \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
if [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
python test/run_test.py \
--include inductor/test_mkldnn_pattern_matcher \
-k 'xpu' \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
fi
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -627,6 +646,8 @@ test_perf_for_dashboard() {
device=cuda_a10g
elif [[ "${TEST_CONFIG}" == *h100* ]]; then
device=cuda_h100
elif [[ "${TEST_CONFIG}" == *b200* ]]; then
device=cuda_b200
elif [[ "${TEST_CONFIG}" == *rocm* ]]; then
device=rocm
fi
@ -801,6 +822,16 @@ test_dynamo_benchmark() {
if [[ "${TEST_CONFIG}" == *perf_compare* ]]; then
test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
# TODO (huydhn): Just smoke test some sample models
if [[ "${TEST_CONFIG}" == *b200* ]]; then
if [[ "${suite}" == "huggingface" ]]; then
export TORCHBENCH_ONLY_MODELS="DistillGPT2"
elif [[ "${suite}" == "timm_models" ]]; then
export TORCHBENCH_ONLY_MODELS="inception_v3"
elif [[ "${suite}" == "torchbench" ]]; then
export TORCHBENCH_ONLY_MODELS="hf_Bert"
fi
fi
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
@ -1039,20 +1070,10 @@ test_libtorch_api() {
mkdir -p $TEST_REPORTS_DIR
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml
"$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml
else
# Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"
# On s390x, pytorch is built without llvm.
# Even if it would be built with llvm, llvm currently doesn't support used features on s390x and
# test fails with errors like:
# JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
# unknown file: Failure
# C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then
python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr
fi
fi
# quantization is not fully supported on s390x yet
@ -1603,6 +1624,25 @@ test_operator_benchmark() {
--expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"
}
test_operator_microbenchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
TEST_DIR=$(pwd)
cd benchmarks/operator_benchmark/pt_extension
python -m pip install .
cd "${TEST_DIR}"/benchmarks/operator_benchmark
for OP_BENCHMARK_TESTS in matmul mm addmm bmm; do
$TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}_compile.json" \
--benchmark-name "PyTorch operator microbenchmark" --use-compile
$TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}.json" \
--benchmark-name "PyTorch operator microbenchmark"
done
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
@ -1627,6 +1667,10 @@ elif [[ "${TEST_CONFIG}" == *xla* ]]; then
install_torchvision
build_xla
test_xla
elif [[ "$TEST_CONFIG" == *vllm* ]]; then
echo "vLLM CI uses TORCH_CUDA_ARCH_LIST: $TORCH_CUDA_ARCH_LIST"
(cd .ci/lumen_cli && python -m pip install -e .)
python -m cli.run test external vllm --test-plan "$TEST_CONFIG" --shard-id "$SHARD_NUMBER" --num-shards "$NUM_TEST_SHARDS"
elif [[ "${TEST_CONFIG}" == *executorch* ]]; then
test_executorch
elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
@ -1653,6 +1697,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then
test_operator_benchmark cpu ${TEST_MODE}
fi
elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then
test_operator_microbenchmark
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
@ -1672,54 +1718,40 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
elif [[ "${TEST_CONFIG}" == cachebench ]]; then
install_torchaudio
install_torchvision
checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_cachebench
PYTHONPATH=/torchbench test_cachebench
elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
install_torchaudio
install_torchvision
checkout_install_torchbench nanogpt
PYTHONPATH=$(pwd)/torchbench test_verify_cachebench
PYTHONPATH=/torchbench test_verify_cachebench
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
install_torchaudio
install_torchvision
install_torchao
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
PYTHONPATH=/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
PYTHONPATH=/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest
TORCHBENCHPATH=/torchbench test_torchbench_gcp_smoketest
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchvision
PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
if [[ "$SHARD_NUMBER" -eq "1" ]]; then
test_inductor_aoti
fi
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *einops* ]]; then
test_einops
elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

View File

@ -61,9 +61,10 @@ if "%USE_XPU%"=="1" (
call "C:\Program Files (x86)\Intel\oneAPI\compiler\latest\env\vars.bat"
call "C:\Program Files (x86)\Intel\oneAPI\ocloc\latest\env\vars.bat"
if errorlevel 1 exit /b 1
:: Reduce build time. Only have MTL self-hosted runner now
SET TORCH_XPU_ARCH_LIST=xe-lpg
SET USE_KINETO=0
:: Reduce build time
SET TORCH_XPU_ARCH_LIST=bmg
:: Re-setup python env for build
call pip install -r requirements.txt
)
@echo on
@ -136,7 +137,7 @@ sccache --show-stats
python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"
(
if "%BUILD_ENVIRONMENT%"=="" (
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.
) else (
copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"

View File

@ -3,12 +3,12 @@ if "%BUILD_ENVIRONMENT%"=="" (
) else (
set CONDA_PARENT_DIR=C:\Jenkins
)
set CONDA_ROOT_DIR=%CONDA_PARENT_DIR%\Miniconda3
:: Be conservative here when rolling out the new AMI with conda. This will try
:: to install conda as before if it couldn't find the conda installation. This
:: can be removed eventually after we gain enough confidence in the AMI
if not exist %CONDA_PARENT_DIR%\Miniconda3 (
if not exist %CONDA_ROOT_DIR% (
set INSTALL_FRESH_CONDA=1
)
@ -17,10 +17,14 @@ if "%INSTALL_FRESH_CONDA%"=="1" (
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
%TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_PARENT_DIR%\Miniconda3
%TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /AddToPath=0 /D=%CONDA_ROOT_DIR%
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)
:: Activate conda so that we can use its commands, i.e. conda, python, pip
call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3
call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%
:: Activate conda so that we can use its commands, i.e. conda, python, pip
call conda activate py_tmp
call pip install -r .ci/docker/requirements-ci.txt

View File

@ -14,7 +14,7 @@ if not errorlevel 0 exit /b
:: build\torch. Rather than changing all these references, making a copy of torch folder
:: from conda to the current workspace is easier. The workspace will be cleaned up after
:: the job anyway
xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
xcopy /s %CONDA_ROOT_DIR%\envs\py_tmp\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
pushd .
if "%VC_VERSION%" == "" (

View File

@ -38,13 +38,20 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
python -m pip install tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
# scipy from 1.6.3 to 1.10
# expecttest from 0.1.3 to 0.3.0
# xdoctest from 1.0.2 to 1.3.0
python -m pip install "future==0.18.2" "hypothesis==5.35.1" "expecttest==0.3.0" "librosa>=0.6.2" "scipy==1.10.1" "psutil==5.9.1" "pynvml==11.4.1" "pillow==9.2.0" "unittest-xml-reporting<=3.2.0,>=2.0.0" "pytest==7.1.3" "pytest-xdist==2.5.0" "pytest-flakefinder==1.1.0" "pytest-rerunfailures==10.3" "pytest-shard==0.1.2" "sympy==1.11.1" "xdoctest==1.3.0" "pygments==2.12.0" "opt-einsum>=3.3" "networkx==2.8.8" "mpmath==1.2.1" "pytest-cpp==2.3.0" "boto3==1.35.42"
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.15.1.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.30
python -m pip install tlparse==0.4.0
# Install parameterized
python -m pip install parameterized==0.8.1
@ -52,9 +59,6 @@ python -m pip install parameterized==0.8.1
# Install pulp for testing ilps under torch\distributed\_tools
python -m pip install pulp==2.9.0
# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308
python -m pip install expecttest==0.3.0
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -37,7 +37,7 @@ IF "%CUDA_PATH_V126%"=="" (
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90

View File

@ -37,10 +37,10 @@ IF "%CUDA_PATH_V128%"=="" (
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V128%"

View File

@ -0,0 +1,59 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V130%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0\bin\nvcc.exe" (
set "CUDA_PATH_V130=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0"
) ELSE (
echo CUDA 13.0 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V130%"
set "PATH=%CUDA_PATH_V130%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -1,12 +1,20 @@
copy "%CUDA_PATH%\bin\cusparse*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cublas*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudart*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\curand*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cufft*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib
if %CUDA_VERSION% geq 130 (
set "dll_path=bin\x64"
) else (
set "dll_path=bin"
)
copy "%CUDA_PATH%\%dll_path%\cusparse*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cublas*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cudart*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\curand*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cufft*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cusolver*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\nvJitLink_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib
@ -20,8 +28,3 @@ copy "%libuv_ROOT%\bin\uv.dll" pytorch\torch\lib
if exist "C:\Windows\System32\zlibwapi.dll" (
copy "C:\Windows\System32\zlibwapi.dll" pytorch\torch\lib
)
::copy nvJitLink dll is requires for cuda 12+
if exist "%CUDA_PATH%\bin\nvJitLink_*.dll*" (
copy "%CUDA_PATH%\bin\nvJitLink_*.dll*" pytorch\torch\lib
)

View File

@ -26,6 +26,7 @@ if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%
if %CUDA_VER% EQU 126 goto cuda126
if %CUDA_VER% EQU 128 goto cuda128
if %CUDA_VER% EQU 129 goto cuda129
if %CUDA_VER% EQU 130 goto cuda130
echo CUDA %CUDA_VERSION_STR% is not supported
exit /b 1
@ -113,6 +114,33 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda130
set CUDA_INSTALL_EXE=cuda_13.0.0_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS="
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.12.0.46_cuda13-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda_common
:: NOTE: We only install CUDA if we don't have it installed already.
:: With GHA runners these should be pre-installed as part of our AMI process

View File

@ -1,9 +1,9 @@
set WIN_DRIVER_VN=528.89
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe" & REM @lint-ignore
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe
set WIN_DRIVER_VN=580.88
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe" & REM @lint-ignore
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe
if errorlevel 1 exit /b 1
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe -s -noreboot
if errorlevel 1 exit /b 1
del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL
del %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe || ver > NUL

View File

@ -1,12 +1,22 @@
set ADDITIONAL_OPTIONS=""
set PYTHON_EXEC="python"
if "%DESIRED_PYTHON%" == "3.13t" (
echo Python version is set to 3.13t
set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"
set ADDITIONAL_OPTIONS="Include_freethreaded=1"
set PYTHON_EXEC="python3.13t"
) else if "%DESIRED_PYTHON%"=="3.14" (
echo Python version is set to 3.14 or 3.14t
set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.14.0/python-3.14.0rc1-amd64.exe"
) else if "%DESIRED_PYTHON%"=="3.14t" (
echo Python version is set to 3.14 or 3.14t
set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.14.0/python-3.14.0rc1-amd64.exe"
set ADDITIONAL_OPTIONS="Include_freethreaded=1"
set PYTHON_EXEC="python3.14t"
) else (
echo DESIRED_PYTHON not defined, Python version is set to %DESIRED_PYTHON%
echo Python version is set to %DESIRED_PYTHON%
set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/%DESIRED_PYTHON%.0/python-%DESIRED_PYTHON%.0-amd64.exe" %= @lint-ignore =%
)

View File

@ -13,9 +13,9 @@ if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"
:xpu_bundle_install_start
set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product
set XPU_BUNDLE_VERSION=2025.0.1+20
set XPU_BUNDLE_VERSION=2025.1.3+5
set XPU_BUNDLE_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_EXTRA_URL=NULL
@ -24,9 +24,9 @@ set XPU_EXTRA_VERSION=2025.0.1+1226
set XPU_EXTRA_INSTALLED=0
set XPU_EXTRA_UNINSTALL=0
if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.1] (
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe
set XPU_BUNDLE_VERSION=2025.1.3+5
if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.2] (
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/24751ead-ddc5-4479-b9e6-f9fe2ff8b9f2/intel-deep-learning-essentials-2025.2.1.25_offline.exe
set XPU_BUNDLE_VERSION=2025.2.1+20
)
:: Check if XPU bundle is target version or already installed
@ -90,14 +90,3 @@ if errorlevel 1 exit /b 1
del xpu_extra.exe
:xpu_install_end
if not "%XPU_ENABLE_KINETO%"=="1" goto install_end
:: Install Level Zero SDK
set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip
curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"
echo "Installing level zero SDK..."
7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"
set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"
del "%SRC_DIR%\temp_build\level_zero_sdk.zip"
:install_end

View File

@ -7,6 +7,8 @@ call "internal\install_python.bat"
%PYTHON_EXEC% --version
set "PATH=%CD%\Python\Lib\site-packages\cmake\data\bin;%CD%\Python\Scripts;%CD%\Python;%PATH%"
if "%DESIRED_PYTHON%" == "3.14t" %PYTHON_EXEC% -m pip install numpy==2.3.2 cmake
if "%DESIRED_PYTHON%" == "3.14" %PYTHON_EXEC% -m pip install numpy==2.3.2 cmake
if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install numpy==2.2.1 cmake
if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install numpy==2.1.2 cmake
if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake

View File

@ -124,20 +124,31 @@ popd
export TH_BINARY_BUILD=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export MACOSX_DEPLOYMENT_TARGET=10.15
export MACOSX_DEPLOYMENT_TARGET=11.0
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
SETUPTOOLS_PINNED_VERSION="==70.1.0"
PYYAML_PINNED_VERSION="=5.3"
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
RENAME_WHEEL=true
case $desired_python in
3.14t)
echo "Using 3.14 deps"
NUMPY_PINNED_VERSION="==2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
RENAME_WHEEL=false
;;
3.14)
echo "Using 3.14t deps"
NUMPY_PINNED_VERSION="==2.1.0"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
RENAME_WHEEL=false
;;
3.13t)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
NUMPY_PINNED_VERSION="==2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
@ -145,37 +156,23 @@ case $desired_python in
;;
3.13)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
NUMPY_PINNED_VERSION="==2.1.0"
;;
3.12)
echo "Using 3.12 deps"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.0.2"
NUMPY_PINNED_VERSION="==2.0.2"
;;
3.11)
echo "Using 3.11 deps"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
NUMPY_PINNED_VERSION="==2.0.2"
;;
3.10)
echo "Using 3.10 deps"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
;;
3.9)
echo "Using 3.9 deps"
SETUPTOOLS_PINNED_VERSION=">=70.1.0"
PYYAML_PINNED_VERSION=">=5.3"
NUMPY_PINNED_VERSION="=2.0.2"
NUMPY_PINNED_VERSION="==2.0.2"
;;
*)
echo "Using default deps"
NUMPY_PINNED_VERSION="=1.11.3"
echo "Unsupported version $desired_python"
exit 1
;;
esac
@ -184,17 +181,17 @@ tmp_env_name="wheel_py$python_nodot"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}
source activate "$tmp_env_name"
retry pip install -r "${pytorch_rootdir}/requirements-build.txt"
pip install "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions
PINNED_PACKAGES=(
"numpy${NUMPY_PINNED_VERSION}"
)
retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements-build.txt"
pip install requests ninja typing-extensions
retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
retry brew install libomp
# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule
export USE_DISTRIBUTED=1
if [[ -n "$CROSS_COMPILE_ARM64" ]]; then
export CMAKE_OSX_ARCHITECTURES=arm64
fi
export USE_MKLDNN=OFF
export USE_QNNPACK=OFF
export BUILD_TEST=OFF
@ -202,16 +199,7 @@ export BUILD_TEST=OFF
pushd "$pytorch_rootdir"
echo "Calling setup.py bdist_wheel at $(date)"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
python setup.py bdist_wheel -d "$whl_tmp_dir"
fi
python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name ${mac_version}
echo "Finished setup.py bdist_wheel at $(date)"

View File

@ -65,16 +65,8 @@ fi
if [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"
pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"
# todo: after folder is populated use the pypi_pkg channel instead
pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"
retry pip install -q numpy protobuf typing-extensions
else
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
fi
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
else
pip install "\$pkg"
retry pip install -q numpy protobuf typing-extensions

View File

@ -71,14 +71,7 @@ export PYTORCH_BUILD_NUMBER=1
# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == "cu129" ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi
TRITON_CONSTRAINT="platform_system == 'Linux'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
@ -134,7 +127,6 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"
export DESIRED_CUDA="$DESIRED_CUDA"
export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"
if [[ "${OSTYPE}" == "msys" ]]; then
export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"
if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

View File

@ -23,10 +23,6 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then
AWS_S3_CP="aws s3 cp"
fi
if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"
fi
# this is special build with all dependencies packaged
if [[ ${BUILD_NAME} == *-full* ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"
@ -55,16 +51,12 @@ s3_upload() {
s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"
fi
(
cache_control_flag=""
if [[ "${UPLOAD_CHANNEL}" = "test" ]]; then
cache_control_flag="--cache-control='no-cache,no-store,must-revalidate'"
fi
for pkg in ${PKG_DIR}/*.${extension}; do
(
set -x
shm_id=$(sha256sum "${pkg}" | awk '{print $1}')
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \
--metadata "checksum-sha256=${shm_id}" ${cache_control_flag}
--metadata "checksum-sha256=${shm_id}"
)
done
)

View File

@ -15,8 +15,7 @@ fi
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
export USE_SCCACHE=0
export XPU_VERSION=2025.1
export XPU_ENABLE_KINETO=1
export XPU_VERSION=2025.2
fi
echo "Free space on filesystem before build:"

Some files were not shown because too many files have changed in this diff Show More