Compare commits

...

1280 Commits

Author SHA1 Message Date
b53536b427 Fix #156261 _foreach_copy indexing (#158238)
Fix #156261 _foreach_copy indexing (#156719)

Fixes #156261

Thanks to @ngimel's fast eyes

For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719
Approved by: https://github.com/albanD

(cherry picked from commit 4ee4863232b9e07728d85254768bcba3aadc9b9a)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-07-14 16:42:28 -04:00
ff2170c940 Add sm_70 to windows 12.9 build (#158265)
Add sm_70 to windows 12.9 build (#158126)

Please see: https://github.com/pytorch/pytorch/issues/157517
Volta architectures will be kept for 12.8/12.9 builds for release 2.8 (12.8 win build does not need change since already including sm70)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158126
Approved by: https://github.com/Skylion007, https://github.com/atalman

(cherry picked from commit 86d8af6a6cc648134289de89d393d0dce5b3a5f4)

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-07-14 16:40:59 -04:00
f0101fd29f docs: add get_default_backend_for_device to distributed documentation (#158236)
docs: add get_default_backend_for_device to distributed documentation (#156783)

`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey, https://github.com/malfet

(cherry picked from commit b146ca74f01df3cf711fd0f855e05805e490156c)

Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-07-14 16:32:32 -04:00
bd17a453df don't error out in empty_cache under mempool context (#158180)
don't error out in empty_cache under mempool context (#158152)

Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152
Approved by: https://github.com/zou3519, https://github.com/eqy

(cherry picked from commit 9056279f8159b052599a31b591a78da1acc4224c)

Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
2025-07-14 09:24:47 -07:00
983b95f90b [autograd] Avoid creating and recording event when unnecessary (#157914)
[autograd] Avoid creating and recording event when unnecessary (#157503)

Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.

2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.

We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.

Removing this unnecessary create + record event should save a few us for each instance avoided.

Fixes https://github.com/pytorch/pytorch/issues/157407

----

Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715

(cherry picked from commit 8bda95228fbefa6ce204bf4da8b632d1516431bb)

Co-authored-by: soulitzer <soulitzer@gmail.com>
2025-07-14 07:19:27 -07:00
175782834a [aarch64] Add sm_80 to CUDA SBSA build (#158118)
[aarch64] Add sm_80 to CUDA SBSA build (#157843)

related to https://github.com/pytorch/pytorch/issues/152690

This adds sm_80 to CUDA SBSA builds (12.9), so that we will be able to support Ampere family (e.g: sm_86) and Ada family (e.g: sm_89) on CUDA SBSA builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157843
Approved by: https://github.com/Skylion007, https://github.com/atalman

(cherry picked from commit 297daa1d30c80826b939d8f2dcd07422dec72642)

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-07-11 13:41:21 -04:00
6e08036e2b [user triton] AOT inductor support for device-side TMA (#157241)
[user triton] AOT inductor support for device-side TMA (#155896)

Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`

Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.

To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs

This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)

Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).

For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire

(cherry picked from commit b6c00dfe249a7bfc1d61a322d5bc30f164353abf)

Co-authored-by: David Berard <dberard@fb.com>
2025-07-11 13:21:56 -04:00
fd227e5208 Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157968)
Add sm_70 arch for linux cuda 12.8 and 12.9 builds (#157558)

Please see: https://github.com/pytorch/pytorch/issues/157517
We would like to keep Volta architectures by default for release 2.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157558
Approved by: https://github.com/Skylion007, https://github.com/Camyll, https://github.com/seemethere, https://github.com/malfet

(cherry picked from commit 179dcc10e4e0c742fb7d93b832021d0c177798bf)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-07-11 10:28:47 -04:00
058b58ac9d Revert "Turn on compile with NVSHMEM (#154538)" (#158040)
This reverts commit 3685b101709d466ec79a976c55e9efcd0d1a19fa.
2025-07-11 09:13:30 -04:00
0afa9afa5e Revert "Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568)" (#158039)
This reverts commit 34c6371d24a350db754a90c75361cdcf48cc0e71.
2025-07-11 09:12:52 -04:00
d24444a4cf [cherry-pick] revert #156552 (#156767)
Revert "Add fx_graph_runnable tests boilerplate (#156552)"

This reverts commit 0a2ec7681d2af973d8daaf7905431a088739dc90.
2025-07-10 15:04:11 -07:00
f885f15603 cherrypick revert of #152932 for release 2.8 (#158031)
* Revert "Add unified memory APIs for torch.accelerator (#152932)"

This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029.

* Revert "Add DeviceAllocator as the base device allocator (#138222)"

This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.
2025-07-10 10:28:14 -07:00
d1d97caf15 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157454)
[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)

Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg

(cherry picked from commit 82eefaedd98b63de8a87e34275af781f8eb177e1)

Co-authored-by: David Berard <dberard@fb.com>
2025-07-09 17:24:51 -04:00
b0543d6b7c [release] Triton pin update to 3.4 (#157752)
[release] Triton pin update to 3.4 (#156664)

Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206
Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664
Approved by: https://github.com/davidberard98

(cherry picked from commit 3d06ff82a84a118f0ed246864d4fc01ac4726328)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-07-08 19:03:21 -04:00
6550831051 [inductor][static launcher] Skip correctness test for test_floats (#157200)
[inductor][static launcher] Skip correctness test for test_floats (#157023)

https://github.com/triton-lang/triton/issues/6176 causes kernels that take fp64 scalar inputs to generate wrong results. Until we get around to fixing this, just skip the accuracy check (it'll fail on Triton's launcher anyway).

Differential Revision: [D77407307](https://our.internmc.facebook.com/intern/diff/D77407307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157023
Approved by: https://github.com/jamesjwu

(cherry picked from commit e8217ad8becd2b297682c685a9179997cb0a98cc)

Co-authored-by: David Berard <dberard@fb.com>
2025-07-07 15:01:33 -07:00
a83d7ae76a [ONNX] Bump onnxscript api for torch 2.8 (#157137)
[ONNX] Bump onnxscript api for torch 2.8 (#157017)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157017
Approved by: https://github.com/titaiwangms, https://github.com/malfet

(cherry picked from commit 36fd1ac9324429c095f8fbc5f6d2bd4b71f18d61)

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-07-07 15:00:43 -07:00
90d16ee800 Fix macOS build with USE_MPS=OFF (#156932)
Fix macOS build with `USE_MPS=OFF` (#156847)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156847
Approved by: https://github.com/angelayi

(cherry picked from commit 455dfd258980294f0745bd90aee12a323e37224d)

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2025-07-07 14:59:26 -07:00
f6166e4427 [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157718)
[dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598)

`lru_cache` usage warning was being raised for `torch.get_device_module()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598
Approved by: https://github.com/Sidharth123-cpu

(cherry picked from commit 52e4e41cbc36a5cf44395ff84ca2d069263560de)

Co-authored-by: William Wen <williamwen@meta.com>
2025-07-07 14:58:18 -07:00
30a9a25b15 [dynamo] Fix source for lru_cache method (#157308)
[dynamo] Fix source for lru_cache method (#157292)

Fixes - https://github.com/pytorch/pytorch/issues/157273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292
Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel

(cherry picked from commit 3684be056d9af667400ba071a116be8b1112bba8)

Co-authored-by: Animesh Jain <anijain@umich.edu>
2025-07-07 09:55:02 -07:00
347259f35a [cherry-pick] Organize BUCK for torch/standalone and Rename torch::standalone to headeronly (#157418)
* Organize BUCK for torch/standalone (#156503)

Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself

Test Plan:
CI

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D76920013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503
Approved by: https://github.com/swolchok

(cherry picked from commit acaf6ba3c6d0bdec88ab3f6c2ef82650050558d2)

* Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251)

Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week.

Original summary:

headeronly is more clear, let's change the name before anyone depends on standalone

Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire

(cherry picked from commit fee2377f9ea62223f69ea9904c5e25ccb2af5961)

---------

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-07-07 09:52:12 -07:00
9fd946f04e [PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#157422)
[PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu  (#155255)

Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0

***Build Failure Logs***

**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
  302 |     if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
  303 |         (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~

```

**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
   19 | class Vectorized<ComplexFlt> {
      |       ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’

aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  633 |   auto abs_a = a.abs_2_();
      |                  ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  634 |   auto abs_b = b.abs_2_();
      |                  ^~~~~~

/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  666 |       vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  673 |       vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
      |                 ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  680 |       vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```

***With  this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
  -- USE_TENSORPIPE=True
  -- USE_GLOO=True
  -- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc

```

This patch will fix the pytorch build issue on power, and i am able to build successfully.

Hi @malfet  @albanD

Please review this PR for pytorch build issue that we are observing on power.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet

(cherry picked from commit 5e18bc333144473f1f10bc8a5ba05dba7950fb8a)

Co-authored-by: Avanish Tiwari <avanish@linux.ibm.com>
2025-07-07 09:50:47 -07:00
7228ea51f8 [ONNX] Fix conversion of attention - 4D (#157509)
[ONNX] Fix conversion of attention - 4D (#157130)

Fixes a wrong conversion to onnx while investigation #149662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130
Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms


(cherry picked from commit 0105cd89ab508ec56126c1de85c8f5b5acc446b5)

Co-authored-by: xadupre <xadupre@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-07-07 09:47:18 -07:00
8b879b0cac [dynamo] Fix bug in dict(mapping_proxy) (#157515)
[dynamo] Fix bug in dict(mapping_proxy) (#157467)

Fixes https://github.com/pytorch/pytorch/issues/157284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467
Approved by: https://github.com/jansel, https://github.com/StrongerXi


(cherry picked from commit 48560eef80e97e855cbb8e2814acefe8f5cc6fbd)

Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-07 12:46:28 -04:00
b10409a417 [cherry-pick] [fake tensor] fix issue of no attribute tags (#156689) (#157519)
[fake tensor] fix issue of no attribute tags (#156689)

Fixes #156688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman

(cherry picked from commit 7597988f1b5a41c0b91d379e0ce51111fd7cc95a)
2025-07-07 09:44:51 -07:00
0cb3b840b0 Add einops x torch.compile testing in PyTorch CI (#157416) (#157588)
Fixes #146782. This PR adds testing for multiple einops versions in
PyTorch CI. This occurs in a new "einops" CI job that runs for both
Python 3.9 and 3.13 (aka, what we test Dynamo over).

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416
Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305
2025-07-07 09:41:11 -07:00
d91af204d8 Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157641)
Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable.  (#157630)

This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch:
1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change.
2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt

(cherry picked from commit 7275f280454f790414b24147a2ba7f94d0eabcf6)

Co-authored-by: Andrey Talman <atalman@meta.com>
2025-07-04 19:22:01 -04:00
3db12a59d7 Remove +PTX from CUDA 12.8 builds (#157634)
Remove +PTX from CUDA 12.8 builds (#157516)

Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh.
Removing +PTX reduces binary size required to be able to upload binaries to pypi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516
Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv

(cherry picked from commit 84085229765698166f07c9220d5544023ab80d47)

Co-authored-by: atalman <atalman@fb.com>
2025-07-04 14:54:10 -04:00
2b0f8e9c86 Cleanup leftover miniconda brew installation (#157567)
Cleanup leftover miniconda brew installation (#156898)

That results in torch.compile being unable to produce working artifacts

Should fix https://github.com/pytorch/pytorch/issues/156833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898
Approved by: https://github.com/seemethere, https://github.com/atalman

(cherry picked from commit 214e2959dcdbf91a999d5c0a5d40c91e4442e8c5)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-07-04 14:53:20 -04:00
8e00a755d6 Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157539)
Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466)

#149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782).

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466
Approved by: https://github.com/clee2000, https://github.com/atalman

(cherry picked from commit 5d5a5b3501dfb0759ed36d0a88b65cdcd87c1e27)

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@quansight.com>
2025-07-04 14:48:28 -04:00
3e6f088e40 [aarch64] Add back NCCL lib to cuda arm wheel (#157105)
[aarch64] Add back NCCL lib to cuda arm wheel (#156888)

We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime.

https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build.

related - https://github.com/pytorch/pytorch/issues/144768

```
Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888
Approved by: https://github.com/atalman

(cherry picked from commit de45c5f673ce261e9a82c54280beeda36cff640e)

Co-authored-by: Ting Lu <tingl@nvidia.com>
2025-07-04 14:45:41 -04:00
9f3fd07c40 [MPS] Revert cumsum/cumprod to MPSGraph implementation (#157494)
[MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708
Approved by: https://github.com/malfet

(cherry picked from commit 2d7e6c6241971106a56073d7a53c7d1336b11a51)

Co-authored-by: Manuel Candales <mcandales@meta.com>
2025-07-03 09:22:10 -07:00
7d5aff5f25 [ez] Disable some failing periodic tests (#157560)
[ez] Disable some failing periodic tests (#156731)

test_torch.py::TestTorchDeviceTypeCUDA::test_storage_use_count_cuda:
Added in https://github.com/pytorch/pytorch/pull/150059
Fails in debug mode [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44706020831) [HUD commit link](4491326fb0)

inductor/test_inductor_freezing.py::FreezingGpuTests::test_cpp_wrapper_cuda:
[GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44707119967) [HUD commit link](4491326fb0)
started failing after moving to new cuda version https://github.com/pytorch/pytorch/pull/155234

I'll ping people if this gets merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156731
Approved by: https://github.com/huydhn

(cherry picked from commit 2ff3280c77c705e11c5211d4be8fef9853cd0559)

Co-authored-by: Catherine Lee <csl@fb.com>
2025-07-03 09:19:55 -07:00
ec45b4c0fe Revert "Update triton version to 3.4" (#157471)
Revert "Update triton version to 3.4 (#156890)"

This reverts commit 03eb1e40f9ddf09cb9eef86ace74332e87f11a79.
2025-07-02 13:33:21 -04:00
4f0798f34d [ROCm] Bump AOTriton to 0.10b (#156845)
[ROCm] Bump AOTriton to 0.10b (#156499)

Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd

(cherry picked from commit d9577df312d477e8fa5b9d7bc61fb1f2c07b8e48)

Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
2025-06-30 09:03:56 -04:00
998ffd4b25 [cherry-pick] revert #156517 on release 2.8 (#156768)
Revert "[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)"

This reverts commit fb75dea2c1b93c78dccf08d5fd5e20b362ecd405.
2025-06-26 10:42:35 -04:00
3d53a53e50 Fix environment and push env var for docker image builds for binary builds (#156916)
Fix environment and push env var for docker image builds for binary builds  (#156910)

Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch

Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io.  You also don't do the push unless WITH_PUSH is true

binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910
Approved by: https://github.com/atalman

(cherry picked from commit 78ee2ee90eed957aec3dc80423b108b16938a8ae)

Co-authored-by: Catherine Lee <csl@fb.com>
2025-06-25 22:15:13 -04:00
03eb1e40f9 Update triton version to 3.4 (#156890)
[TEST] triton Update 3.4 - 2
2025-06-25 18:05:36 -04:00
8de5ce7155 [cherry pick] revert #155412 (#156757)
Revert "Remove remaining CUDA 12.4 CI code (#155412)"

This reverts commit 9fed2addedb42da86b657165fe14eadc911232cf.
2025-06-25 14:53:55 -04:00
421d45d9a1 [RELEASE 2.8] Release only changes (#156728)
* [RELEASE 2.8] Release only changes

* test

* tests_files

* docker
2025-06-24 16:35:15 -04:00
3a7ff829c5 Fix MacOS MP hang in Python-3.12+ (#155698)
By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 )  at exit, as at this point handle to child process might no longer be valid

Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)

Fixes https://github.com/pytorch/pytorch/issues/153050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
2025-06-24 12:13:35 +00:00
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
4e8dd11be1 simplify nvrtc discovery login in compile_kernel (#156674)
Followup from https://github.com/pytorch/pytorch/pull/156332

Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380

Works just fine on dev gpus
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674
Approved by: https://github.com/malfet
2025-06-24 08:55:40 +00:00
ce73b0c53f Validate custom op support for compile_kernel (#156332)
Follow-up work from #151484 - just makes sure that compile_kernel composes nicely with custom ops by writing some new tests, no new code functionality is added

benchmark failure in CI is unrelated to this change, CI is green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156332
Approved by: https://github.com/zou3519, https://github.com/malfet
2025-06-24 08:21:21 +00:00
35e44067c4 Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-06-24 07:57:48 +00:00
cyy
ce1a07570d Fix TORCH_CUDA_ARCH_LIST (#156667)
Before the fix, `TORCH_CUDA_ARCH_LIST` variable contains string `TORCH_CUDA_ARCH_LIST`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156667
Approved by: https://github.com/ngimel
2025-06-24 07:27:53 +00:00
04178d347c [Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-06-24 06:09:59 +00:00
a7b29c88b1 [ONNX] Preserve all legacy exporter params in fallback (#156659)
Fixes #151693

Previous to this PR, the fallback does not take care of all user parameters. This pr preserves them to ensure a smooth transition for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156659
Approved by: https://github.com/justinchuby
2025-06-24 05:28:55 +00:00
a6a8641c8a Fix UT failure on non-cuda backend (#156577)
# Motivation
`HAS_TRITON` is a generic API that could return `True` on xpu backend. It will result in these cases failing on xpu. So we should use `HAS_CUDA` (equivalently `torch.cuda.is_available() && HAS_TRITON`) to avoid these failures.

Please refer to https://github.com/pytorch/pytorch/actions/runs/15813693789/job/44569593370#step:15:2129

# Additional Context
This PR aims to fix the CI failure soon. We will have a dedicated PR to generalize these UT to be generic. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @daisyden
Fix https://github.com/pytorch/pytorch/issues/156576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156577
Approved by: https://github.com/jansel
2025-06-24 05:24:24 +00:00
495c317005 Replace deprecated is_compiling method (#154476)
Replace depreacted `is_compiling` in `torch._dynamo` with `torch.compiler`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154476
Approved by: https://github.com/eellison
2025-06-24 05:16:40 +00:00
1044934878 [CUDAGraph] add config cudagraph_capture_sizes (#156551)
Users may want CUDAGraph for certain sizes and fallback for other sizes.

As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.

Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.

This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.

Example:
```python
import torch

torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]

def f(x):
    return x + 1

f = torch.compile(f, mode="reduce-overhead", dynamic=False)

def run(batch_size, seq_len, d):
    x = torch.randn((batch_size, seq_len, d), device="cuda")
    # Need to mark the dimension as dynamic. Automated-dynamic
    # may have some ux issues on matching `cudagraph_capture_sizes`
    # with the actual dynamic shapes, since there are specialization and
    # multiple dynamo graphs.
    torch._dynamo.mark_dynamic(x, 0)
    torch._dynamo.mark_dynamic(x, 1)
    for _ in range(3):
        f(x)

for i in range(2, 10):
    for j in range(2, 10):
        run(i, j, 8)

num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
2025-06-24 05:14:49 +00:00
899d3d3e9e Don't call sum() on a tensor that is not summable in layer_norm (#156600)
Don't call `sum()` on a tensor that is default constructed.

Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:

```
Traceback (most recent call last):
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
    method(*args, **kwargs)
  File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
    ln_out_cuda.backward(grad_output_cuda)
  File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
    torch.autograd.backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::TensorBase::options() const from :0
#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0

```

Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-06-24 05:00:42 +00:00
17eb649d55 Implement guard collectives (optimized version) (#156562)
This is a remix of https://github.com/pytorch/pytorch/pull/155558

Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562
Approved by: https://github.com/Microve
2025-06-24 04:59:49 +00:00
73772919d2 remove deprecated numpy.typing.mypy_plugin in mypy.ini (#156601)
Fixes #156489
removed deprecated numpy plugin in mypy.ini
 @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156601
Approved by: https://github.com/ezyang
2025-06-24 04:56:08 +00:00
6d5c789ad5 [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555
Approved by: https://github.com/ezyang
ghstack dependencies: #144551, #144554
2025-06-24 04:53:54 +00:00
e600e044a7 Revert "[aotd] Support mutations of the same input in fw and bw (#155354)"
This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f.

Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see 930b575389/1 ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))
2025-06-24 04:42:14 +00:00
930b575389 [symm_mem] Add sym mem test into ptd h100 ci (#156634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156634
Approved by: https://github.com/ngimel, https://github.com/mori360
2025-06-24 03:43:22 +00:00
b2d473c8f8 [ROCm][Windows] Fix rocsolver undefined symbol error (#156591)
Fix undefined symbol error while using `rocsolver_ssyevd_strided_batched` call in `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156591
Approved by: https://github.com/jeffdaily
2025-06-24 03:28:45 +00:00
87d615efab [fr] Use a vector to temporarily keep the reference to future object to avoid block (#156653)
At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished.

Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653
Approved by: https://github.com/kwen2501
2025-06-24 03:25:04 +00:00
cyy
b09bd414a6 Deprecate c10::string (#155084)
Now there is no mention of c10::string in OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155084
Approved by: https://github.com/ezyang
2025-06-24 03:03:06 +00:00
0a2ec7681d Add fx_graph_runnable tests boilerplate (#156552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156552
Approved by: https://github.com/StrongerXi
2025-06-24 02:41:38 +00:00
9665702c64 [nativert] reland D76832891 remove designated initializer cpp20 (#156565)
Summary: fix windows build broke in https://github.com/pytorch/pytorch/pull/156508

Test Plan:
ci

Rollback Plan:

Differential Revision: D77080420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156565
Approved by: https://github.com/zhxchen17
2025-06-24 02:38:08 +00:00
6a3d00aa3b Add Windows cuda 12.9.1 build (#156630)
Without Support for SegmentReduce.cu
Test PR confirmed by Removing SegmentReduce.cu windows build for CUDA 12.9 can succeed

Related to: https://github.com/pytorch/pytorch/issues/156181
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156630
Approved by: https://github.com/malfet

Co-authored-by: Ting Lu <tingl@nvidia.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-24 02:15:49 +00:00
a9ef7c4d04 [dynamo] update to lru_cache message and updated user stack trace in debug mode (#156639)
I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639
Approved by: https://github.com/atalman
2025-06-24 01:52:13 +00:00
86996c15dc [Inductor] Allow exhaustive autotuning across all GEMM options (#156610)
Differential Revision: D76843916

Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature:
Excessive register spillage
Using much larger amounts of shared memory than available on the hardware
This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive:

```
  AUTOTUNE mm(1152x21504, 21504x1024)
  strides: [21504, 1], [1, 21504]
  dtypes: torch.bfloat16, torch.bfloat16
  mm 0.1167 ms 100.0%
  triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices
  INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610
Approved by: https://github.com/jansel
2025-06-24 01:42:05 +00:00
40a785103c [dynamo] fix debugging code_parts for relational guards (#154753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154753
Approved by: https://github.com/anijain2305
ghstack dependencies: #154772
2025-06-24 01:38:29 +00:00
849468034d [dynamo] fix selecting shape guards (#154772)
Not all LAMBDA_GUARDs are shape guards. Only the epilogue guards
are lambda guards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154772
Approved by: https://github.com/anijain2305
2025-06-24 01:38:29 +00:00
5dd9652389 Clean up HF components (#155707)
Differential Revision: [D76427358](https://our.internmc.facebook.com/intern/diff/D76427358/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155707
Approved by: https://github.com/saumishr
2025-06-24 00:07:37 +00:00
ca5a40395d [partitioner] Fix _broadcast_on_rank0 to use deterministic hash function (#153734)
Summary:
I was using python's hash, which is not deterministic across different interpreter runs.

Use hashlib instead.

Test Plan:
Run using it

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_cc-8e17be61ce?job_attempt=1&version=0&tab=summary&env=prod

Differential Revision: D74882405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153734
Approved by: https://github.com/Microve
2025-06-24 00:06:23 +00:00
24063ad109 Fix native static dispatch kernels (#156331)
Summary: Fix for native static dispatch kernels not taking effect

Test Plan:
```
buck2 test //sigmoid/backend/test:static_kernels_ops_test

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --pytorch_predictor_sigmoid_static_dispatch_enable=true --pytorch_predictor_sigmoid_graph_passes_enable=true --benchmarkEnableProfiling=true --load_lowered_merge=3 --using_aoti_lowering_allowlist=false --requestFilePath=/data/users/georgiaphillips/replayer/inputs/742055223/0/mix/742055223_0_mix.inputs.recordio --benchmarkNumIterations=2
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76559764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156331
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-06-24 00:05:49 +00:00
380e30a723 [EZ/Profiler] Change 'b' to 'B' in FunctionEvent Frontend (#156250)
Summary: Fixes https://github.com/pytorch/pytorch/issues/149311

Test Plan:
Just changes string output

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      60.993us         0.97%      60.993us       1.848us           0 B           0 B            33
...
```

Rollback Plan:

Differential Revision: D76857251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250
Approved by: https://github.com/sanrise
2025-06-23 23:25:04 +00:00
07bb097698 Fix clang-tidy bugprone* warnings (#148529)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148529
Approved by: https://github.com/ezyang
2025-06-23 23:09:56 +00:00
3f920f3d8f [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-23 22:25:45 +00:00
c82a174cea Extract CPU log_softmax kernels to header (#156243)
This allows sharing them with ExecuTorch.

Differential Revision: [D76830114](https://our.internmc.facebook.com/intern/diff/D76830114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156243
Approved by: https://github.com/janeyx99
2025-06-23 21:31:16 +00:00
96e4c95cd8 [Inductor] Subgraph as a choice symbolic expression as input (#156185)
Differential Revision: D76514984

Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185
Approved by: https://github.com/masnesral
2025-06-23 21:29:17 +00:00
b1d62febd0 Revert "Use official CUDAToolkit module in CMake (#154595)"
This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100.

Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))
2025-06-23 21:15:31 +00:00
31e1274597 [MTIA Aten Backend] Migrate max.dim_max / min.dim_min (#156568)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate max.dim_max / min.dim_min to in-tree.

Differential Revision: [D77095185](https://our.internmc.facebook.com/intern/diff/D77095185/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156568
Approved by: https://github.com/malfet
ghstack dependencies: #156502, #156539, #156554
2025-06-23 20:43:39 +00:00
dfdd636cfa [aoti] Check longlong upperbound for codegening input size check (#156522)
Summary:
Fixes
```
error: integer literal is too large to be represented in any integer type
 38979 |     if (arg410_1_size[0] > 1171368248680556527362) {
```

Test Plan: ci

Differential Revision: D77057898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156522
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-06-23 20:38:34 +00:00
edd9c09e73 [MTIA Aten Backend] Migrate isnan (#156554)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate isnan to in-tree.

Differential Revision: [D77094811](https://our.internmc.facebook.com/intern/diff/D77094811/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156554
Approved by: https://github.com/malfet
ghstack dependencies: #156502, #156539
2025-06-23 20:22:32 +00:00
070e580d30 [MTIA Aten Backend] Migrate _log_softmax.out / _log_softmax_backward_data.out (#156539)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate _log_softmax.out / _log_softmax_backward_data.out to in-tree.

Differential Revision: [D77044380](https://our.internmc.facebook.com/intern/diff/D77044380/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156539
Approved by: https://github.com/malfet
ghstack dependencies: #156502
2025-06-23 19:56:01 +00:00
93cd16512f [MTIA Aten Backend] Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out (#156502)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out to in-tree.

Differential Revision: [D76917384](https://our.internmc.facebook.com/intern/diff/D76917384/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156502
Approved by: https://github.com/malfet
2025-06-23 19:56:01 +00:00
ee4d343499 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)" (#156624)
This reverts changes to [test/dynamo/test_repros.py](https://github.com/pytorch/pytorch/compare/main...atalman:revert_only_portion_of_file?expand=1#diff-4c82a5798a61d4cceb176b2700ba6fdd7c3e72d575b8e7e22458589139459caa)

Missed by: ee3d9969cc (diff-036cb21341ff8e390cc250e74fe9e3f0f15f259ea4bec4abcce49d95febf1553)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156624
Approved by: https://github.com/Camyll
2025-06-23 19:30:08 +00:00
56b3bf0c74 [nativert] Move HigherOrderKernel (#156507)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan: CI

Differential Revision: D77032074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156507
Approved by: https://github.com/zhxchen17
2025-06-23 19:29:27 +00:00
d061a02e6e Revert "[invoke_subgraph] make same subgraph share get_attr target (#156260)"
This reverts commit 39dd2f4d7defc63164a7969bfac0d0c62ffac900.

Reverted https://github.com/pytorch/pytorch/pull/156260 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156260#issuecomment-2997478798))
2025-06-23 18:24:10 +00:00
35d03398e5 Revert "[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)"
This reverts commit f179b7198522e6d93bd103efba1a1ebd5a2cf891.

Reverted https://github.com/pytorch/pytorch/pull/156347 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156347#issuecomment-2997453729))
2025-06-23 18:19:29 +00:00
98a34e8d4b Move code out of individual token linters (#152256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152256
Approved by: https://github.com/Skylion007
2025-06-23 18:16:33 +00:00
da910e603a [ROCm] update state check for test_trace_while_active* (#153545)
When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-23 17:58:14 +00:00
55ef7b15e0 Revert "[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463)"
This reverts commit afbf5420b8745099bf7d871f5a4fb6dec338f825.

Reverted https://github.com/pytorch/pytorch/pull/156463 on behalf of https://github.com/atalman due to This is temoprary revert, to restore diff train sync. We should be good to reland this change ([comment](https://github.com/pytorch/pytorch/pull/156463#issuecomment-2997335541))
2025-06-23 17:44:36 +00:00
a95504b10f [torchbench] update environment setup script (#156465)
Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model.

This PR adds a config to only install torchbench models we want to run.

Example usage:
```
# Install 1 torchbench model
make build-deps TORCHBENCH_MODELS="alexnet"

# Install 3 torchbench models
make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch"

# Install all models
make build-deps

# Install all models
make build-deps TORCHBENCH_MODELS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465
Approved by: https://github.com/ezyang
2025-06-23 17:41:29 +00:00
e583b88819 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))
2025-06-23 17:36:44 +00:00
f179b71985 [invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #156260
2025-06-23 17:10:07 +00:00
39dd2f4d7d [invoke_subgraph] make same subgraph share get_attr target (#156260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-06-23 17:10:07 +00:00
276c790010 [ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587)
This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072

One-Shot:
![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f)
As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers.

Two-Shot:
![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc)
Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587
Approved by: https://github.com/jeffdaily
2025-06-23 16:14:01 +00:00
5a533f74a1 Checkout optional submodules when publishing a release tarball (#156615)
This includes Eigen and nccl for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156615
Approved by: https://github.com/huydhn
2025-06-23 16:08:22 +00:00
6835ba1b34 Register hpu device to fake backend (#156076)
## MOTIVATION

This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created.

## CHANGES
- Add "hpu" to the list of devices

@ankurneog, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076
Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD
2025-06-23 16:08:08 +00:00
cc410d3761 [SymmMem] Rename all_to_all_vdev ops (#156582)
`all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156582
Approved by: https://github.com/fduwjj
ghstack dependencies: #155134
2025-06-23 15:57:36 +00:00
640f5a7090 [dynamo] Support builtin bool on non-constant VTs (#155863)
In practice `bool(...)` is either constant folded by Dynamo or used for
branching (so most of its emulation logic lived in
`InstructionTranslator.generic_jump`.

This patch adds a dedicated `bool` hanlder (only for symbolic
bool/int/float for now), and fixes #136075.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863
Approved by: https://github.com/williamwen42
2025-06-23 15:53:15 +00:00
6b45af38a5 [easy] better copy_misaligned_inputs assertion failure message (#154472)
internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/688540560729579/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154472
Approved by: https://github.com/williamwen42
2025-06-23 15:39:15 +00:00
2e9bd03f60 Implemented Size.__radd__ (#152554)
Fixes #144334
Builds on top of #146834 by @khushi-411

The needed trick was to add `PyNumberMethods` because these Number Protocol appears to be responsible for `__radd__` (see https://stackoverflow.com/q/18794169)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152554
Approved by: https://github.com/albanD

Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-06-23 15:38:37 +00:00
3cbae6dde8 [MPSInductor][BE] Fix multistage reduction check (#156567)
From less than max threadgroup size to less or equal to that, which eliminates redundant trivial loops.

I.e. it changes shader code generated for
```python
import torch

def f(x):
    var, mean = torch.var_mean(x, dim=2, keepdim = True)
    return x / var, var

torch.compile(f)(torch.rand(1, 16, 1024, dtype=torch.float32, device='mps'))

```

from
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr1,
    device float* out_ptr2,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int x0 = xindex;
    threadgroup float3 tmp_acc_0[1024];
    tmp_acc_0[r0_index * 1] = 0.0;
    for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) {
        int r0_1 = 1 * r0_index + r0_1_cnt;
        auto tmp0 = in_ptr0[r0_1 + 1024*x0];
        tmp_acc_0[r0_index * 1] = ::c10:🤘:welford_combine(tmp_acc_0[r0_index * 1], float3(tmp0, 0.0, 1.0));
    }
    auto tmp1 = c10:🤘:threadgroup_welford_combine(tmp_acc_0, 1024);
    auto tmp2 = 1023.0;
    auto tmp3 = tmp1.y / tmp2;
    out_ptr1[x0] = static_cast<float>(tmp3);
    for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) {
        int r0_1 = 1 * r0_index + r0_1_cnt;
        auto tmp4 = in_ptr0[r0_1 + 1024*x0];
        auto tmp5 = tmp4 / tmp3;
        out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp5);
    }
}
```
to
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr1,
    device float* out_ptr2,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int r0_1 = r0_index;
    int x0 = xindex;
    threadgroup float tmp_acc_0[1024];
    auto tmp0 = in_ptr0[r0_1 + 1024*x0];
    tmp_acc_0[r0_index * 1] = tmp0;
    auto tmp1 = c10:🤘:threadgroup_welford_reduce(tmp_acc_0, 1024);
    auto tmp2 = 1023.0;
    auto tmp3 = tmp1.y / tmp2;
    out_ptr1[x0] = static_cast<float>(tmp3);
    auto tmp4 = tmp0 / tmp3;
    out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp4);
}

``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156567
Approved by: https://github.com/dcci
ghstack dependencies: #156566
2025-06-23 14:49:26 +00:00
e28925aa75 [MPS] Activation kernels: do compute at float precision (#155735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155735
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479, #155571, #155586
2025-06-23 14:48:57 +00:00
f5e1b24945 Revert "Enable Leak Sanitizer (#154584)"
This reverts commit c79c7bbe615265b6b3d7df39d6d5a68afd7d6b2a.

Reverted https://github.com/pytorch/pytorch/pull/154584 on behalf of https://github.com/cyyever due to Need to suppress more output ([comment](https://github.com/pytorch/pytorch/pull/154584#issuecomment-2995792265))
2025-06-23 10:08:40 +00:00
4f70fbbd16 Revert "Use CMake wholearchive group (#156393)"
This reverts commit d1b4e0fa9a5feb22fc6de1d36dc4c9dac685caed.

Reverted https://github.com/pytorch/pytorch/pull/156393 on behalf of https://github.com/etaf due to This PR is breaking XPU windows build. ([comment](https://github.com/pytorch/pytorch/pull/156393#issuecomment-2995576362))
2025-06-23 09:03:19 +00:00
92409b6c89 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD
2025-06-23 08:49:30 +00:00
d5781c8d21 remove allow-untyped-defs from torch/fx/passes/utils/fuser_utils.py (#156538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156538
Approved by: https://github.com/ezyang
2025-06-23 08:18:16 +00:00
e0ae4ecca8 Refactor cpp codegen to support overridable class attributes. (#155553)
- Refactored CppKernelProxy and CppScheduling to use class-level attributes (kernel_cls, kernel_proxy_cls) for backend-specific kernel customization.
 - Avoids method duplication (e.g., codegen_functions, codegen_node) for backend-specific overrides thus reduces downstream maintenance when upgrading Torch.
 - Ensures type safety with annotations while keeping core logic centralized and extensible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155553
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2025-06-23 07:36:30 +00:00
cyy
67ee0c6725 Remove outdated Android workarounds of nearbyintf (#151292)
This PR uses std::nearbyint on all supported platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151292
Approved by: https://github.com/ezyang
2025-06-23 06:28:15 +00:00
cyy
d1b4e0fa9a Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-06-23 06:22:34 +00:00
cyy
099d0d6121 Simplify nvtx3 CMake handling, always use nvtx3 (#153784)
Fall back to third-party NVTX3 if system NVTX3 doesn't exist. We also reuse the `CUDA::nvtx3` target for better interoperability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153784
Approved by: https://github.com/ezyang
2025-06-23 06:12:46 +00:00
31659964a5 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-23 05:58:39 +00:00
cyy
c79c7bbe61 Enable Leak Sanitizer (#154584)
It enables Leak Sanitizer and also provides a suppression file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154584
Approved by: https://github.com/ezyang
2025-06-23 05:20:27 +00:00
9fed2added Remove remaining CUDA 12.4 CI code (#155412)
Because no 12.4 job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155412
Approved by: https://github.com/ezyang
2025-06-23 05:16:38 +00:00
4cd6e96bf0 [MPSInductor] Fix nested loop var elimination (#156566)
As reduction resuts must be kept around
Add regression test that is specific for this issue

Fixes https://github.com/pytorch/pytorch/issues/156426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156566
Approved by: https://github.com/dcci
2025-06-23 04:35:16 +00:00
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
5b210bb3a6 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-23 02:57:50 +00:00
ced90016c1 [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316
2025-06-23 02:57:41 +00:00
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
4ccc0381de [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-23 02:57:28 +00:00
1b2146fc6d [BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314
Approved by: https://github.com/jingsh
ghstack dependencies: #156313
2025-06-23 02:57:19 +00:00
6ff6630375 [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
c55eef79f8 [Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395)
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
2025-06-23 02:00:42 +00:00
3c7079959c [Inductor][CPP] Enable WOQ int4 concat linear (#156387)
**Summary**
Enable the concat linear optimization pass in Inductor for woq int4 linear.

**Test Plan**
```
 python test/inductor/test_cpu_select_algorithm.py -k test_int4_concat_woq_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156387
Approved by: https://github.com/CaoE, https://github.com/jansel
ghstack dependencies: #156407
2025-06-23 01:52:00 +00:00
03023f178c FlexAttn config refactor + ROCm optimisations (#156307)
This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends.

This PR also introduces:
1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases.
3. Updates configs for ROCm flex autotune path providing perf optimisations

AMD perf numbers on score mod benchmark (default inputs)
flex_attn | mode | Speedup (Avg) | Speedup (Max)
-- | -- | -- | --
fwd | autotune before PR | 2.608 | 20.56
fwd | autotune after PR | 2.862 | 22
fwd | exhaustive_autotune | 2.943 | 22.471
bwd | autotune before PR | 2.196 | 9.831
bwd | autotune after PR | 2.423 | 11.331
bwd | exhaustive_autotune | 2.566 | 13.87

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307
Approved by: https://github.com/drisspg, https://github.com/jansel
2025-06-22 22:27:38 +00:00
a5cbb2bcb3 Improve All to All Perf for inter-node use-case (#156376) (#156389)
Summary:

For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf.

Test Plan:
Verify on two hosts
Host1:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip}  --node_rank=0  comms.py --	master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Host2:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip}  --node_rank=1  comms.py --	master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda

Rollback Plan:

Differential Revision: D76937048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389
Approved by: https://github.com/kwen2501
2025-06-22 20:45:46 +00:00
a28e6ae38f [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156401)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156401
Approved by: https://github.com/albanD
ghstack dependencies: #156400
2025-06-22 18:40:38 +00:00
1d522325b4 [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156400)
As the title stated.

**Changes:**

- add resize_ for OpenReg
- migrate related tests into test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156400
Approved by: https://github.com/albanD
2025-06-22 18:40:38 +00:00
54b8087f63 Improve torch.ops typing (#154555)
Summary:
Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors.

Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test Plan: CI

Differential Revision: D75497142

Co-authored-by: Benjamin Glass <bglass@quansight.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1
2025-06-22 15:52:27 +00:00
10fb98a004 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-22 15:05:08 +00:00
b5c8b8d09f Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit b46eb1ccaff944cdcd43e9ce3958819226d2952f.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
5e56db59d4 Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 2c372a0502578e0136a84423c3f49c19c26d6bb7.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
c10eeb5bad Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 537b0877a87948bc221301a518fdbc1cf772bc7e.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
ee3d9969cc Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 24dc33b37b50ec92da08fc693dd83e7c87b74f8b.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
f1331f3f1b Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)"
This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a.

Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
5b427c92a8 Revert "[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)"
This reverts commit ead741c5fb0036e0fc95b79d4fe1af3a426e1306.

Reverted https://github.com/pytorch/pytorch/pull/156314 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
145d4cdc11 Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)"
This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18.

Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
035a68d25a Revert "[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)"
This reverts commit ee72815f1180fe2d8bcdb23493999256169ac2fa.

Reverted https://github.com/pytorch/pytorch/pull/156317 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
1d3bca40ed Revert "[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)"
This reverts commit a23ccaa8479e038e79532759a64e9947c0fac43d.

Reverted https://github.com/pytorch/pytorch/pull/156319 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874a3116f1067f9092456ee7281003614.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
afbf5420b8 [dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463)
This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352

This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.*

As of right now with this code change the terminal outputs:

if the code came from torch.* :
Nothing, as we shouldn't warn for functions that come from torch.*

else:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)

If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0]   File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module>
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0]     torch.compile(foo, backend="eager")(torch.randn(4))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463
Approved by: https://github.com/williamwen42
2025-06-22 11:40:28 +00:00
aeaf6b59e2 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-22 11:39:31 +00:00
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
a23ccaa847 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-22 08:43:49 +00:00
ee72815f11 [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316
2025-06-22 08:43:41 +00:00
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
ead741c5fb [BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314
Approved by: https://github.com/jingsh
ghstack dependencies: #156313
2025-06-22 08:43:18 +00:00
3627270bdf [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
cyy
08dae945ae Use official CUDAToolkit module in CMake (#154595)
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with `torch::` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595
Approved by: https://github.com/albanD
2025-06-22 05:44:29 +00:00
1d993fa309 Don't change set_skip_guard_eval_unsafe for DisableContext, since compiler won't run (#156490)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156490
Approved by: https://github.com/anijain2305
2025-06-22 00:51:32 +00:00
333e0e6147 Make build-deps drop builds into current venv again (#156200)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200
Approved by: https://github.com/malfet
2025-06-22 00:45:02 +00:00
74ebd8d14e use guard_or_false for expand utils reduction (#155868)
This is classic broadcast like pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155868
Approved by: https://github.com/bobrenjc93
2025-06-21 23:42:19 +00:00
f70c80105e Enables NCCL symmetric memory kernels through mempool registration (#155134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-06-21 23:24:04 +00:00
9e132b770e [CUDA] Skip test on low vram machines (#156548)
I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines.

CC: @eqy @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548
Approved by: https://github.com/eqy, https://github.com/malfet
2025-06-21 22:32:57 +00:00
e4ae60a413 [SymmMem] Add NVSHMEM Quiet support to Triton (#156475)
This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations.

Changes:
- Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py`
- Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including:
  - A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion
  - Flag-based signaling only after `quiet()` completes, guaranteeing data delivery
  - Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475
Approved by: https://github.com/kwen2501
ghstack dependencies: #156472, #156473, #156474
2025-06-21 22:19:58 +00:00
c2d1b225e6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-06-21 19:57:21 +00:00
04b91a9e43 [SymmMem] Add NVSHMEM Fence support to Triton (#156474)
This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed.

 Changes:
- Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py`
- Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including:
  - A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls
  - Final fence before flag update to ensure all data transfers complete before signaling
  - Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees

 Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156472, #156473
2025-06-21 18:57:05 +00:00
c06c2569ee [ca] Support TorchDispatchMode via pass through (#156516)
The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes

In terms of differences with eager autograd engine:
- For function mode, CA additionally disables/re-enables `_set_multithreading_enabled`
- For dispatch mode:
  - accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead
  - Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516
Approved by: https://github.com/jansel
ghstack dependencies: #156374, #156509
2025-06-21 18:33:47 +00:00
5f2f343e1e [ca] suggest to disable compiled autograd for trace-time NotImplementedErrors (#156509)
Example:

```python
  File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd.
  You can disable compiled autograd for this operation by:
  1.  Relocating the unsupported autograd call outside the compiled region.
  2.  Wrapping the unsupported autograd call within a scope that disables compiled autograd.
  3.  Configuring the specific compilation unit to disable compiled autograd.
  4.  Globally disabling compiled autograd at the application's initialization.
```

No duplicate error messages for python side trace-time errors
```python
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture
    raise NotImplementedError(
NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either:
1. Moving the unsupported autograd call outside of the torch.compile'd region.
2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager.
3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call.
4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509
Approved by: https://github.com/jansel
ghstack dependencies: #156374
2025-06-21 18:33:46 +00:00
f1968a5e76 [ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374)
These aren't supported. Not sure how they passed CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374
Approved by: https://github.com/jansel
2025-06-21 18:33:38 +00:00
fab85fc5f9 [compile][hierarchical compilation] Release nested_compile_region API (#156449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-21 15:14:59 +00:00
fb75dea2c1 [logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)
Summary: Discussed internally at https://fburl.com/workplace/v3hllrs9. With coordinate descent tuning enabled, we're missing the dynamo_timed logging.

Test Plan:
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 1 --performance --cold-start-latency`
* tlparse: https://fburl.com/bh2hxw4z
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u88ogw39
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/yqljow6c

Rollback Plan:

Differential Revision: D77053918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156517
Approved by: https://github.com/mengluy0125
2025-06-21 14:17:19 +00:00
a47ca4fc74 Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033)" (#156546)
Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](9de23d0c29)

This reverts commit 9de23d0c29dfac8dc0f6f234bdbcd85a6375fa81.

PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546
Approved by: https://github.com/jansel
2025-06-21 14:10:12 +00:00
d846e21355 Revert "[nativert] move layout planner algorithms to libtorch (#156508)"
This reverts commit eab45643f22e58ee12d95d8b0162d51ca0a50801.

Reverted https://github.com/pytorch/pytorch/pull/156508 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/15793524714/job/44524067679) [HUD commit link](eab45643f2) ([comment](https://github.com/pytorch/pytorch/pull/156508#issuecomment-2993589983))
2025-06-21 13:42:40 +00:00
1cfdcb975a [CUDA] fix illegal memory access in attention (#155397)
Fixes https://github.com/pytorch/pytorch/issues/150054

CI seemed to be messed up in the old one, old PR:
https://github.com/pytorch/pytorch/pull/155145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397
Approved by: https://github.com/ngimel
2025-06-21 12:32:00 +00:00
cd75cf3cab [symm_mem] Add one side put API for nvshvem (#156443)
`nvshmem_put(Tensor tensor, int peer)`, where `tensor` must be a symmetric tensor, i.e. rendezvoused before this call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156443
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-06-21 12:16:36 +00:00
4ff0e033c1 [SymmMem] Add NVSHMEM signal_wait_until support to Triton (#156473)
This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs.

Changes:
- Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py`
- Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675))
- Implemented `test_triton_signal_wait_until` with:
  - Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block`
  - Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value
  - End-to-end validation of both data transfer and signal synchronization

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
ghstack dependencies: #156472
2025-06-21 10:55:40 +00:00
8485f19507 remove gso from vector_norm (#156530)
guard_or_false here does same thing that guard_size_oblivuous do, note that
size is >=0 and this is size like by definition since its a tensor size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156530
Approved by: https://github.com/bobrenjc93
2025-06-21 08:42:36 +00:00
6ffa03ef9e [Inductor-CPU] int8 WoQ concat linear (#153004)
### Summary

int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape.

### Perf data

GPT-J 128 input tokens, 128 output tokens.
32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded.

| May 8 nightly first token latency | First token latency with this implementation | Rest token latency with May 8 nightly | Rest token latency with this implementation combined with #149373  |
|---|---|---|---|
|202 ms | 190 ms | 33 ms | 30 ms|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel

Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
2025-06-21 08:40:09 +00:00
35321b2ad6 remove make_fast_binary_impl from make_fast_binary_impl (#156528)
This was added in https://github.com/pytorch/pytorch/pull/133584.
Take slow path when we cant determine fast path is valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156528
Approved by: https://github.com/bobrenjc93
2025-06-21 08:27:54 +00:00
eab45643f2 [nativert] move layout planner algorithms to libtorch (#156508)
Summary: tt

Test Plan:
ci

Rollback Plan:

Differential Revision: D76832891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156508
Approved by: https://github.com/zhxchen17
2025-06-21 07:35:40 +00:00
bf50d71553 Add missing inline namespace CPU_CAPABILITY to Gelu/Elu.h (#156512)
As I recently learned the hard way (#156243), it is necessary to put kernel code that uses Vectorized in headers in this namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156512
Approved by: https://github.com/malfet
2025-06-21 06:26:23 +00:00
e3b44edfd8 [SymmMem] Add NVSHMEM wait_until support to Triton (#156472)
This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs.

 Changes:
- Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`.
- Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including:
  - A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag.
  - Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data.

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until`

```python
@triton.jit
def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

@triton.jit
def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr):
    nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val)

...

if rank == 0:
    print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21")
    wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib)
    print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...")
    print(f"[RANK 0] Current out buffer contents: {out.tolist()}")
    torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device))
    print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.")

if rank == 1:
    print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0")
    put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)
    print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0")
    put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib)
    print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.")

...
```
Output:
```
[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21
[RANK 1] About to PUT 8 elements of value 13 to rank 0
[RANK 1] About to PUT flag value 21 to wake up rank 0
[RANK 1] FLAG PUT complete! Rank 0 should wake up now.
[RANK 0] WAIT IS OVER! Flag was set, checking data now...
[RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13]
[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.
[RANK 0] Test completed successfully! 🎉
[RANK 1] Test completed successfully! 🎉

...

----------------------------------------------------------------------
Ran 1 test in 18.773s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472
Approved by: https://github.com/kwen2501
2025-06-21 06:18:31 +00:00
92c79f36db [PGO] frame-specific whitelist logging (#155959)
Summary:
In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1).

This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states).

The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry.

Test Plan:
loggercli codegen GeneratedPt2CompileEventsLoggerConfig

Rollback Plan:

Differential Revision: D76628834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959
Approved by: https://github.com/bobrenjc93
2025-06-21 06:15:51 +00:00
9de23d0c29 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-21 05:47:54 +00:00
b8ace6f951 Make dtensor tests device agnostic (#155687)
## MOTIVATION
This PR is a continuation of https://github.com/pytorch/pytorch/pull/154840 and we are trying to make the tests more device agnostic by removing hard coded references to any particular device.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

## CHANGES
1. test_convolution_ops.py:
    - Replace "cuda" with self.device_type
2. test_random_ops.py:
    - Remove setting and using TYPE_DEVICE variable since device_type is set as per the environment (device) in DTensorTestBase class.
    - Replace "cuda" with self.device_type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155687
Approved by: https://github.com/EikanWang, https://github.com/d4l3k
2025-06-21 04:51:59 +00:00
f3ec16c26a [MTIA Aten Backend][3/n] Migrate mm.out from out-of-tree to in-tree (#154393)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate mm.out from out-of-tree to in-tree.

We dispatch mm.out to MTIA separately from CPU/CUDA. So this diff adds the file `MTIAOps.cpp` under `ATen/native/mtia` to hold the dispatched functions. In future we can split `MTIAOps.cpp` to categorized ops files.

Differential Revision: [D74743849](https://our.internmc.facebook.com/intern/diff/D74743849/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154393
Approved by: https://github.com/albanD, https://github.com/egienvalue, https://github.com/nautsimon
2025-06-21 04:31:04 +00:00
88b9c285e0 Workaround for e4m2 dtype (#156461)
Found in: https://github.com/pytorch/ao/pull/2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156461
Approved by: https://github.com/vkuzo
2025-06-21 04:01:44 +00:00
554b568040 Add internal use only utility to allow externally visible side effects within HOPs (#155715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155715
Approved by: https://github.com/zou3519
2025-06-21 03:55:28 +00:00
c09b054878 Add runtime profiler info for AOTDispatcher prologue (#155785)
Fixes #155721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155785
Approved by: https://github.com/bdhirsh
2025-06-21 03:34:07 +00:00
fd8ea3c8a3 [symm_mem] Add nccl as a backend for symmetric memory (#155740)
Running unit test:

 TORCH_SYMMMEM=NCCL TORCH_DISTRIBUTED_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO pytest test/distributed/test_nccl.py -k test_nccl_symmem_alloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155740
Approved by: https://github.com/kwen2501
2025-06-21 03:22:23 +00:00
ee56e9f8a8 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
2025-06-21 03:02:02 +00:00
b4228a94d1 Split the exclude pattern for CODESPELL linter (#156229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156229
Approved by: https://github.com/albanD
ghstack dependencies: #156080, #156081
2025-06-21 02:47:40 +00:00
e3507c3777 [BE] fix typos in functorch/ and scripts/ (#156081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156081
Approved by: https://github.com/albanD
ghstack dependencies: #156080
2025-06-21 02:47:40 +00:00
2ccfd14e23 [BE] fix typos in docs/ (#156080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-06-21 02:47:32 +00:00
clr
9aaa184105 dynamo: Don't crash when someone tries to access a non existent list member (#156335)
dynamo: Don't crash when someone tries to access a non existent list member

Test added which reproduces the failure. Note that I'm using the new
unimplemented_v2 API. Let me know if people have a strong preference that I use
something else.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335
Approved by: https://github.com/jansel
2025-06-21 02:26:31 +00:00
ac86ec0e60 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-21 01:34:41 +00:00
e98dd95446 [nativert] Move SerialGraphExecutor to PyTorch core (#156459)
Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D76917966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-21 01:32:06 +00:00
a67eb1a0d6 [ez] remove unused functions (#156466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466
Approved by: https://github.com/jingsh
2025-06-21 00:38:34 +00:00
2ee23175d9 [dynamo][guards] Catch exception and return false in the backend match (#156341)
Its difficult to write a test. I found this while debugging a sefgault.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341
Approved by: https://github.com/williamwen42
2025-06-21 00:13:26 +00:00
0f0c010714 [c10d] init_process_group supports index-only device id (#156214)
Before:
```
acc = torch.accelerator.current_accelerator()
if acc:
  local_idx = ...
  dist.init_process_group(
    device_id=torch.device(acc.type, local_idx)
  )
```
After:
```
dist.init_process_group(device_id=local_idx)
```

That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-06-21 00:02:37 +00:00
fbbab794ef [ONNX] Implement Attention-23 (#156431)
Implement Attention-23 using sdpa and flexattention.

- I used copilot for this.
- Also updated the conversion logic to remove trailing None inputs.

@gramalingam @kunal-vaishnavi @titaiwangms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431
Approved by: https://github.com/titaiwangms

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 23:54:57 +00:00
0ad88a2224 Support environement var for autotune log (#156254)
Summary: Titled

Test Plan:
See the scadcastle signal

Rollback Plan:

Differential Revision: D76860928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254
Approved by: https://github.com/Mingming-Ding
2025-06-20 23:06:33 +00:00
6098209bff [BE][5/X] Phase out usage of use_max_autotune() (#156269)
These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269
Approved by: https://github.com/masnesral
2025-06-20 22:37:45 +00:00
5ab257c74c [invoke_subgraph] Make invoke_subgraph cacheable (#156448)
Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-06-20 21:20:23 +00:00
e2351f2dcf fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379)
This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379
Approved by: https://github.com/janeyx99
2025-06-20 20:54:53 +00:00
b8fc5e0c0d skip flaky test in CPython 3.13 tests (#155561)
Changed files:
* test_math.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561
Approved by: https://github.com/zou3519
2025-06-20 20:25:35 +00:00
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
de1930a429 Add ONNX dynamo metadata documentation (#155816)
Describe auto-generated metadata when calling torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 20:12:22 +00:00
a69e27ca5a Remove unused MultiKernelCall import from inductor codegen (#156158)
Since it's now actually used within async_compile.multi_kernel

```
    def multi_kernel(self, *args, **kwargs) -> Any:
        from torch._inductor.codegen.multi_kernel import MultiKernelCall

        # no need to call this in parallel since the sub-kernels are already parallel tasks
        return MultiKernelCall(*args, **kwargs)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-20 19:55:24 +00:00
e5ea24fb27 [nativert] Move auto_functionalize_kernel (#156454)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Copied from original auto_functionalize Diff Summary D53776805:

This is a non-functional kernel implementation for auto_functionalize

In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.

This would mutates the input tensors inplace, which is unsafe in general.

However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.

In the proper functional implementation, it will

make a clone of the mutating input tensor

return these new instance of tensors as AutoFunctionalizeKernel output.

If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.

Test Plan: See internal for test plan

Differential Revision: D76926383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
2025-06-20 19:53:16 +00:00
eb331b59fe Add shim fallback for narrow (#156496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496
Approved by: https://github.com/albanD
2025-06-20 19:47:00 +00:00
6ed85bfe6a Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-20 19:42:57 +00:00
ef6d2cee7a [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-20 18:54:38 +00:00
18e4c461fb Update index.md (#155143)
Related to: https://github.com/pytorch/pytorch/issues/152134
Update to index.md to add language for Stable and Unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143
Approved by: https://github.com/AlannaBurke, https://github.com/atalman

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-20 18:53:32 +00:00
502486d946 [PT2]Add weight and constant config path template (#156359)
Summary: At title.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D76925510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359
Approved by: https://github.com/SherlockNoMad
2025-06-20 18:46:01 +00:00
4b6cbf528b Add C shim fallback for fill_ (#156245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245
Approved by: https://github.com/desertfire
2025-06-20 18:45:48 +00:00
208ec60e72 Revert "[BE] Make Eigen an optional dependency (#155955)"
This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019.

Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))
2025-06-20 18:43:52 +00:00
d309cd1d50 Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969)"
This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d.

Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))
2025-06-20 18:40:38 +00:00
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
39270430c9 [inductor] force min num-split (off by default) (#155941)
This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ).

The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance.

 The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order:
- [xnumel * num_split, rnumel / num_split]
- [xnumel, num_split]

A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2.

Here are some benchmarking results.
```
import torch
from triton.testing import do_bench
import functools
from torch._inductor import config
from torch._dynamo.decorators import mark_dynamic
import os

@torch.compile(dynamic=True)
def f(x):
    return x.sum(dim=0)

N = 512
C = functools.partial(torch.randn, device="cuda")
x_small = C(4096, N)
x_large = C(4096 * 1000, N)

if os.getenv("HINT_WITH_SMALL_INPUT") == "1":
    x = x_small
else:
    x = x_large

mark_dynamic(x, 0)
f(x)

ms = do_bench(lambda: f(x_large))

# 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15
# 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272
# 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba
print(ms)
```
This test mimic what we see in the original problem.

- If we compile with large inputs and benchmark for large inputs, latency is 4.03ms
- if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms
- with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve

The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test.
- Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms
- Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62  Perf is neutral (0.011 ms v.s. 0.011ms)

A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325

# Update for not applying the change to static shape:
from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example:
```
import torch
from torch._inductor import config

aten = torch.ops.aten

def f(x):
    return aten.bernoulli(x).sum()

x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda")
torch.compile(f)(x)
```

With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue.

Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6)

Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941
Approved by: https://github.com/jansel, https://github.com/bobrenjc93
2025-06-20 18:01:28 +00:00
55dae0bf7a Add a basic shim and stable::Tensor is_contiguous API (#156228)
Add a limited is_contiguous in shim, stable::Tensor API with a test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228
Approved by: https://github.com/desertfire
2025-06-20 17:59:52 +00:00
49ee1e7106 [CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138)
Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this
Use `--no-renames` to make both the old name and the old name show up in the diff.  Without it I think only the new name shows up in git diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever
2025-06-20 17:58:04 +00:00
e31f205292 [Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504)
Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504
Approved by: https://github.com/blaine-rister

Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>
2025-06-20 17:43:38 +00:00
d83ff89d3b Add toggle functionality for XPU profiler (#155135)
Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-20 17:27:48 +00:00
1b50c12584 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
2025-06-20 17:21:27 +00:00
63360e64da [BE][Easy] do not install yanked types-pkg-resources in lint environment (#156462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462
Approved by: https://github.com/ezyang
2025-06-20 16:00:43 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
b4442f42a9 Revert "Upgrade to DLPack 1.0. (#145000)"
This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6.

Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))
2025-06-20 15:32:47 +00:00
edd45f3a02 Revert "[Precompile] Hook up backend="inductor" (#155387)"
This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840.

Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](2c68c3e8d5) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))
2025-06-20 15:30:04 +00:00
e1f28fe17b add device generalisation support for distributed tests (#152471)
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices

### CHANGES

- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

- torch/testing/_internal/common_distributed.py

extended common utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
2025-06-20 07:35:42 +00:00
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
24dc33b37b [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-20 07:03:29 +00:00
537b0877a8 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-20 07:03:16 +00:00
2c372a0502 [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-20 07:03:07 +00:00
b46eb1ccaf [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-20 07:02:57 +00:00
2c68c3e8d5 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-20 06:38:29 +00:00
d5b4a32960 [BE] fix PYPROJECT linting errors in test/ and tools/ (#156021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021
Approved by: https://github.com/Skylion007
2025-06-20 06:19:05 +00:00
4cbbc8b458 [MPS] Implement backward pass for interpolate_trilinear (#156373)
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights

TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
2025-06-20 05:41:24 +00:00
c37ddcaefb Fix torchgen update-aoti-shim (#156323)
will remove the fill changes before landing and let Jane merge her changes!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323
Approved by: https://github.com/janeyx99
2025-06-20 05:23:06 +00:00
f7a5ad6c29 [Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407)
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
2025-06-20 03:08:02 +00:00
72c8751b61 Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048)
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
2025-06-20 01:41:03 +00:00
159a39ad34 Add an option for cpp_wrapper to compile entry and kernel separately (#156050)
Fixes #156037.
Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050
Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel
2025-06-20 01:11:16 +00:00
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
cyy
3c2324c64a [2/N] Fix cppcoreguidelines-init-variables suppression (#146237)
This PR removes all `cppcoreguidelines-init-variables` suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237
Approved by: https://github.com/ezyang
2025-06-19 23:26:42 +00:00
52f873adc2 Add logging for async compile worker statistics (#155820)
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.

Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
2025-06-19 23:10:15 +00:00
c60d8188d2 [nativert] Move GraphExecutorBase to PyTorch core (#156196)
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with  execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
CI

Rollback Plan:

Differential Revision: D76830436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
2025-06-19 22:42:35 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
3644b41a7c [ONNX] Note on attention op symbolic function (#156441)
Follow up https://github.com/pytorch/pytorch/pull/156367
Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441
Approved by: https://github.com/justinchuby
2025-06-19 21:00:05 +00:00
443b5b43c3 xpu: fix AOT compilation in sycl cpp extension (#156364)
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.

Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")

CC: @pengxin99, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
2025-06-19 20:11:38 +00:00
d32deb664a [c10d] Disable NCCL NVLS when using deterministic mode (#156381)
via setting env `NCCL_ALGO=^NVLS`.

Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381
Approved by: https://github.com/ngimel
2025-06-19 20:09:24 +00:00
69f2e09cc2 Add more shards to H100 benchmark, and also run it more frequently (#156429)
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours.  I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090

With this computing power, we could also run the whole suite every 4 hours now.  I could run this less frequently later if I see queueing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
2025-06-19 20:02:56 +00:00
aac0e8f0e9 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 20:02:38 +00:00
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
c0ee01c2fb tools/nightly.py: only download torch via pip and install dependenices via uv (#156409)
Setup time (cpu-only): 70s -> 27.6s -> 17.4s

The tool can setup the pinned NVIDIA dependencies correctly:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate
make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda"
make[1]: Entering directory '/home/PanXuehai/Projects/pytorch'
/home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda
log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log
Creating virtual environment
Removing existing venv: /home/PanXuehai/Projects/pytorch/venv
Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv
Installing packages
Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - uv
  - pip
  - setuptools
  - packaging
  - wheel
  - build[uv]
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting uv
  Using cached f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB)
Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1)
Collecting setuptools
  Using cached 17031897da/setuptools-80.9.0-py3-none-any.whl (1.2 MB)
Collecting packaging
  Using cached 38679034af/packaging-25.0-py3-none-any.whl (66 kB)
Collecting wheel
  Using cached 87f3254fd8/wheel-0.45.1-py3-none-any.whl (72 kB)
Collecting build[uv]
  Using cached 80633736cd/build-1.2.2.post1-py3-none-any.whl (22 kB)
Collecting pyproject_hooks (from build[uv])
  Using cached 12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build
Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1
Installing packages took 6.251 [s]
Creating virtual environment took 9.050 [s]
Downloading packages
Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting torch
  Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB)
Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Successfully downloaded torch
Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww:
  - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Downloading packages took 6.284 [s]
Unpacking wheel file
Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK
Unpacking wheel file took 15.107 [s]
Installing dependencies
Installing packages
Installing package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - filelock
  - typing-extensions>=4.10.0
  - setuptools; python_version >= "3.12"
  - sympy>=1.13.3
  - networkx
  - jinja2
  - fsspec
  - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64"
  - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux"
  - numpy
  - cmake
  - ninja
  - packaging
  - ruff
  - mypy
  - pytest
  - hypothesis
  - ipython
  - rich
  - clang-format
  - clang-tidy
  - sphinx
Using Python 3.13.4 environment at: venv
Resolved 78 packages in 2.95s
Installed 76 packages in 93ms
 + alabaster==1.0.0
 + asttokens==3.0.0
 + attrs==24.2.0
 + babel==2.17.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + clang-format==20.1.6
 + clang-tidy==20.1.0
 + cmake==3.25.0
 + decorator==5.2.1
 + docutils==0.21.2
 + executing==2.2.0
 + filelock==3.18.0
 + fsspec==2025.5.1
 + hypothesis==6.135.11
 + idna==3.10
 + imagesize==1.4.1
 + iniconfig==2.1.0
 + ipython==9.3.0
 + ipython-pygments-lexers==1.1.1
 + jedi==0.19.2
 + jinja2==3.1.6
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib-inline==0.1.7
 + mdurl==0.1.2
 + mpmath==1.3.0
 + mypy==1.16.1
 + mypy-extensions==1.0.0
 + networkx==3.5
 + ninja==1.11.1.4
 + numpy==2.3.0
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.3
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.2.5
 + nvidia-nvtx-cu12==12.8.90
 + parso==0.8.4
 + pathspec==0.12.1
 + pexpect==4.9.0
 + pluggy==1.6.0
 + prompt-toolkit==3.0.51
 + ptyprocess==0.7.0
 + pure-eval==0.2.3
 + pygments==2.19.1
 + pytest==8.4.1
 + pytorch-triton==3.3.1+gitc8757738
 + requests==2.32.3
 + rich==14.0.0
 + roman-numerals-py==3.1.0
 + ruff==0.12.0
 + snowballstemmer==3.0.1
 + sortedcontainers==2.4.0
 + sphinx==8.2.3
 + sphinxcontrib-applehelp==2.0.0
 + sphinxcontrib-devhelp==2.0.0
 + sphinxcontrib-htmlhelp==2.1.0
 + sphinxcontrib-jsmath==1.0.1
 + sphinxcontrib-qthelp==2.0.0
 + sphinxcontrib-serializinghtml==2.0.0
 + stack-data==0.6.3
 + sympy==1.14.0
 + traitlets==5.14.3
 + typing-extensions==4.14.0
 + urllib3==2.2.3
 + wcwidth==0.2.13
Installing packages took 3.080 [s]
Installing dependencies took 3.080 [s]
Pulling nightly PyTorch
Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba
Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef
Already up to date.
Pulling nightly PyTorch took 0.017 [s]
Moving nightly files into repo
Moving nightly files into repo took 4.898 [s]
Writing pytorch-nightly.pth
Writing pytorch-nightly.pth took 0.021 [s]
-------
PyTorch Development Environment set up!
Please activate to enable this environment:

  $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate

make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409
Approved by: https://github.com/ezyang
ghstack dependencies: #156408
2025-06-19 19:42:15 +00:00
71faa7e5b9 tools/nightly.py: use uv pip install instead of pip install (#156408)
Setup time: 70s -> 27.6s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408
Approved by: https://github.com/ezyang
2025-06-19 19:42:15 +00:00
134dfb3fe6 [dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791)
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.

We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.

Fixes #155778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
2025-06-19 19:37:44 +00:00
e4c9f6d9a2 [nativert] Move c10_kernel (#156208)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:c10_kernel_test
```

Differential Revision: D76825830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
2025-06-19 17:36:23 +00:00
f402eed4d9 [ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611)
This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4)

CUDAHooks::versionMIOpen() was added to detect MIOpen version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-19 17:22:37 +00:00
085f270a00 [ROCm] Enable more parallelism for multi-dimensional reductions (#155806)
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2025-06-19 17:19:40 +00:00
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
6e185c5312 Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-19 16:27:42 +00:00
6eb6f198e1 update codebase structure documentation to include mps (#156297)
📚 The doc update

adding description about mps folder in code structure guide

@albanD @malfet @svekars @sekyondaMeta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297
Approved by: https://github.com/ezyang
2025-06-19 16:16:29 +00:00
7f0cddfb55 [dynamo] Add documentation for guard_filter_fn (#156114)
Summary: Adding a section of doc for guard_filter_fn.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76756743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114
Approved by: https://github.com/jansel
2025-06-19 16:13:12 +00:00
c9afcffed0 [AOTInductor] Call most runtime fallback ops without calling into Python (#154142)
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python.  This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.

Fixes #150988
Fixes #153478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
2025-06-19 15:27:15 +00:00
317af4c87b Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)"
This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c.

Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))
2025-06-19 15:09:29 +00:00
ab3393e923 [ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368)
Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-19 15:02:40 +00:00
0b62465b99 Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466)"
This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e.

Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))
2025-06-19 14:25:38 +00:00
fec8af8b98 [bugfix] [build] guard cuda version for ipc with fabric handle (#156394)
https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8).

this pr improves the support by adding some compilation-time check against cuda versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394
Approved by: https://github.com/ngimel
2025-06-19 13:54:01 +00:00
769d754ab2 [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-19 13:22:41 +00:00
8cb0c4a4da [Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586)
This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #150287
2025-06-19 13:17:22 +00:00
83259cf7a7 [Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287)
This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode.
The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion.

The performance improvement:

| Suite       | Inductor Speedup (Baseline) | Inductor Speedup (Compared) | Acc Failed | Perf Failed | Inductor Perf Ratio | Speedup  |
|-------------|-----------------------------|------------------------------|------------|--------------|----------------------|----------|
| Huggingface | 2.134838                    | 2.125740314                  | 0          | 0            | 1.001462504          | 100.43%  |
| Torchbench  | 1.808558                    | 1.675100479                  | 0          | 0            | 1.075722187          | 107.97%  |
| Timm        | 2.343893                    | 2.070476653                  | 0          | 0            | 1.131023832          | 113.21%  |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287
Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel
2025-06-19 13:17:22 +00:00
0504480f37 Add CUDA 12.9 libtorch nightly (#155895)
https://github.com/pytorch/pytorch/issues/155196

with libtorch docker added, we can add the build script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895
Approved by: https://github.com/atalman
2025-06-19 13:15:42 +00:00
ccb1f687d6 Port two dynamo test cases for Intel GPU (#156056)
For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-06-19 12:49:04 +00:00
a8fe982993 Revert "[build] Create target for flash attention (#156235)"
This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0.

Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](6d02321472) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))
2025-06-19 11:47:27 +00:00
4da98351b9 [SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211)
Adds NVSHMEM PUT with Signal operation support for Triton kernels:

- Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block
- Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`)
- Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`)

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set`
`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add`

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_set(self) -> None:
    @triton.jit
    def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                         signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
        nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

    # ... setup code ...

    val = 11
    inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val)
    out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)  # destination buffer

    # Signal flag buffer - starts at 0, will be set to 1 upon completion
    flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

    peer = 1 - rank
    NVSHMEM_SIGNAL_SET = 0  # atomic set operation
    SIGNAL_VAL = 1  # completion signal value

    if rank == 0:
        # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1
        put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                    signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET,
                                    peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   # Rank 1 can check flag to know data transfer completed!
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")
```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([1], device='cuda:1')

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

Working as expected! Data is received, and flag set to 1 for completion signal!

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_add(self) -> None:
   @triton.jit
   def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                        signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
       nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

   # ... setup code ...

   # Signal buffer (uint64 flag)
   flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

   peer = 1 - rank
   NVSHMEM_SIGNAL_ADD = 5  # atomic add operation
   SIGNAL_VAL = 16  # Signal value to add

   if rank == 0:
       # Rank 0 puts into Rank 1 and adds to signal
       put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                   signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD,
                                   peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")

```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([16], device='cuda:1')

----------------------------------------------------------------------
Ran 1 test in 17.145s

OK
```

The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-19 10:24:30 +00:00
348e2a76df s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397
Approved by: https://github.com/laithsakka
2025-06-19 10:18:28 +00:00
02080c2cd9 Fix num_heads inference in ONNX Attention-23 exporter (#156367)
Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105

Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3:

![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a)

But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed.

Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-06-19 09:40:01 +00:00
8fcda2c60d [SymmMem] Add runtime detection of NVSHMEM (#156291)
so that we can pick the default backend for SymmetricMemory without
fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM`

On Python side, the following API is added:
`torch.distributed._symmetric_memory.is_nvshmem_available()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117
2025-06-19 08:26:11 +00:00
eabf7cd3c5 [export] update docs for Dims (#156262)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262
Approved by: https://github.com/angelayi
2025-06-19 06:25:21 +00:00
ec0276103f [PGO] fix whitelist scalar bug (#156194)
Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D76830552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194
Approved by: https://github.com/bobrenjc93
2025-06-19 05:51:21 +00:00
1c960c5638 [Makefile] lazily setup lintrunner on first make lint run (#156058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058
Approved by: https://github.com/ezyang
2025-06-19 05:43:35 +00:00
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
ce8180a61d [c10d] Disable stack trace call in logging (#156362)
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76929722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
2025-06-19 05:11:57 +00:00
a21806f038 [ez][export] Better error message for schema check in torch.export.load (#156361)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156354

torch.export.load() only supports files generated by torch.export.save()

Test Plan:
CI

Rollback Plan:

Differential Revision: D76928725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361
Approved by: https://github.com/zhxchen17
2025-06-19 04:50:56 +00:00
3f69e3b3a0 Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-19 04:50:18 +00:00
3bec588bf5 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
2025-06-19 03:47:41 +00:00
6d02321472 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 03:35:04 +00:00
77518d1a13 [CI] fix xpu-smi hang in XPU test container (#156171)
Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171
Approved by: https://github.com/huydhn
2025-06-19 02:48:11 +00:00
19ffdf4ea0 [dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype.  this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.

This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging.  This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.

Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references.  I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.

The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
 with staging_context(stager):
     cpu_state_dict = copy.deepcopy(state_dict)
```

Also, adds support for pinned-memory.

One problem this implementation does not address is that we lose the original device.

The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.

Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.

The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```

Test Plan:
unit tests

Differential Revision: D75993324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-06-19 02:04:21 +00:00
d4ad280429 Enable querying the build and runtime NCCL versions (#156305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305
Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin
2025-06-19 02:00:08 +00:00
bc9bd2a766 Use linux.2xlarge runner (#156351)
The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-06-19 01:50:56 +00:00
e5a1197191 Fix fx tracing for mark dynamic (#156346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346
Approved by: https://github.com/tony-ivchenko
2025-06-19 01:03:09 +00:00
6959b5febe Context on torch.cuda.memory._record_memory_history max_entries (#155889)
Context on torch.cuda.memory._record_memory_history buffer behavior

## Description

Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?

Fixes #129674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
2025-06-19 00:44:43 +00:00
6303cc41b7 [ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262)
We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing.

Fixes #155045.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262
Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 23:52:35 +00:00
b8c2d4c259 add a corner test case of dynamic sizes for combo kernel (#156035)
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op

Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```

Rollback Plan:

Differential Revision: D76699002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
2025-06-18 22:57:09 +00:00
76d07e919f Unbreak //c10/util:base (#156216)
Missing dep.

Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2025-06-18 22:44:20 +00:00
9bfefda296 [DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092)
Summary:
### Diff Context

1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.

2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.

Test Plan: CI

Differential Revision: D75553096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
2025-06-18 22:41:25 +00:00
a5b4463d60 [nativert] session state (#156190)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76827309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190
Approved by: https://github.com/zhxchen17
2025-06-18 22:40:44 +00:00
6918758f55 [export] Update documents for ExportGraphSiganture (#156244)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156184

The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused.

Test Plan:
Document change only. CI

Rollback Plan:

Differential Revision: D76849097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244
Approved by: https://github.com/yushangdi
2025-06-18 22:37:34 +00:00
1e474cc9c8 [ONNX] Fix how shapes are computed for float4 (#156353)
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.

Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
2025-06-18 22:28:02 +00:00
9afee0fa96 [inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201)
internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif

Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201
Approved by: https://github.com/masnesral
2025-06-18 22:15:32 +00:00
e5a0b73ce9 [MTIA Aten Backend] Migrate logical_and.out (#156286)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_and.out to in-tree

Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286
Approved by: https://github.com/nautsimon, https://github.com/jingsh
ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285
2025-06-18 21:57:05 +00:00
bfccfa0b31 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))
2025-06-18 21:48:50 +00:00
f5eb42e4c0 [nativert] move layoutplanneralgorithm to libtorch (#156205)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76831634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205
Approved by: https://github.com/zhxchen17
2025-06-18 21:46:38 +00:00
d1c924c68a [MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate t.Tensor_out / lt.Scalar_out to in-tree.

Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283, #156284
2025-06-18 21:40:26 +00:00
5c7e1d39ab [MTIA Aten Backend] Migrate logit (#156284)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logit to in-tree.

Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283
2025-06-18 21:36:27 +00:00
706e236b08 [MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_or.out / log.out / log2.out to in-tree.

Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047
2025-06-18 21:27:58 +00:00
ab81fb846c [MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047)
Migrate remainder.Tensor_out / reciprocal.out / neg.out

Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046
2025-06-18 21:17:34 +00:00
c26ce593d8 [MTIA Aten Backend] Migrate nan_to_num.out (#156046)
Migrate nan_to_num.out

Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634
2025-06-18 21:14:13 +00:00
2f1c5c4131 [MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634)
# Context

MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.

# This diff

Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.

The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.

----------------

p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time

Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
2025-06-18 21:10:18 +00:00
e99cc126a4 [AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133)
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid

Rollback Plan:

Differential Revision: D76769711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
2025-06-18 20:53:21 +00:00
728cf6721e Revert "[PT2]load dense delta by trimming prefixes (#155872)"
This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f.

Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))
2025-06-18 20:05:56 +00:00
c74fd35050 [PT2]load dense delta by trimming prefixes (#155872)
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.

Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.

Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}  --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```

Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455

USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```

Rollback Plan:

Differential Revision: D76520301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
2025-06-18 19:13:22 +00:00
48de3da253 fix: avoid flamegraph script setup conflicts (#156310)
Fixes #156309

Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet

Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
2025-06-18 19:06:22 +00:00
cbafba5794 Allow forcing FSDP2 to always use SUM reductions (#155915)
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.

Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.

This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).

Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
2025-06-18 18:57:47 +00:00
9944cd0949 Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520)
Related to #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 18:46:04 +00:00
30d3cf62fb support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680)
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-18 18:39:01 +00:00
aee2bfc5ba [Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194)
As title.
Thanks @anmyachev  for the work on compatibility adaptation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194
Approved by: https://github.com/jansel
2025-06-18 18:17:07 +00:00
2620361d19 Add batching rule for torch.matrix_exp (#155202)
## Summary

Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.

Fixes #115992

## Details

`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.

## Testing

The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback

**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
2025-06-18 17:35:35 +00:00
eqy
a5f59cc2ea [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel
2025-06-18 17:32:36 +00:00
94f8679019 Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)"
This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365.

Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))
2025-06-18 16:52:19 +00:00
36f7a027b5 [MPS] Implement upsample_trilinear as Metal shader (#156263)
But only forward for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263
Approved by: https://github.com/dcci
ghstack dependencies: #156256, #156090
2025-06-18 16:10:02 +00:00
bf06190e21 Integrated AMD AWS runners into Pytorch CI (#153704)
Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd

Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>
2025-06-18 15:58:22 +00:00
ce3406817d Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit fe37db4f1270745d6c523623143332ddf263af55.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))
2025-06-18 15:53:32 +00:00
c5d3e7a4ff Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))
2025-06-18 15:51:06 +00:00
408d9884b0 Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))
2025-06-18 15:47:21 +00:00
6201981f48 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))
2025-06-18 15:43:22 +00:00
d290fe7690 Remove legacy export testing path (#156093)
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76734378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
2025-06-18 15:36:44 +00:00
7531bd6491 [ROCm] upgrade to 6.4.1 patch release (#156112)
Fixes #155292.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 15:21:44 +00:00
830a335a7d Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-18 15:15:05 +00:00
6d3a4356f6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
2025-06-18 14:38:55 +00:00
c177abd217 Disable pinning check when loading sparse tensors (#154638)
Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled.

Fixes https://github.com/pytorch/pytorch/issues/153143 .

For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-18 14:33:36 +00:00
8f02161d10 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))
2025-06-18 14:19:39 +00:00
b30e04b3c8 Make the NCCL PG Options and Config copyable and safe to init standalone (#155700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700
Approved by: https://github.com/kwen2501
2025-06-18 13:36:27 +00:00
1bb9b1858b [CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174)
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
2025-06-18 13:17:46 +00:00
d99cac2816 [Kineto][submodule] Update kineto pin for XPU toggle feature (#155488)
Part of #154898
Update kineto submodule

Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-18 12:39:58 +00:00
c11888e7a6 Skip more tests on s390x (#155210)
Make CI for s390x green before fixing and restoring tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210
Approved by: https://github.com/seemethere
2025-06-18 12:07:17 +00:00
402ae09e41 [BE] fix typos in c10/ (#156078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-06-18 10:24:44 +00:00
f45f483884 [user triton] AOT Inductor support for new host-side TMA api (#155879)
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.

Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.

What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.

Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2}  and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
2025-06-18 09:35:11 +00:00
577baa4116 [c10d] Add a logger for all nccl collectives with its time duration when completed (#156008)
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
2025-06-18 09:08:42 +00:00
c5a4fe9c17 [CI] fix the ci image name for public copy in ghcr (#156169)
After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169
Approved by: https://github.com/malfet
2025-06-18 08:16:56 +00:00
a6a3a44144 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-18 07:27:20 +00:00
614a415145 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-18 07:27:20 +00:00
3c8c48f793 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-18 07:27:09 +00:00
920f6e681e [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-18 07:27:00 +00:00
fe37db4f12 [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-18 07:26:52 +00:00
ccc6279b40 flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719)
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198

There are two problems:

(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.

This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.

Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.

(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.

**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.

This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-06-18 07:02:04 +00:00
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
96df866410 [audio hash update] update the pinned audio hash (#156259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259
Approved by: https://github.com/pytorchbot
2025-06-18 06:02:46 +00:00
a5df6ffbc2 Improve IPC for Expandable Segments to use fabric handle when possible (#156074)
Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-06-18 05:22:06 +00:00
29867b211a [cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234)
When using docker with cutlass backend, we can get
```
No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions'
```
First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888

Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-06-18 05:03:43 +00:00
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
a82c171bb2 remove skipifrocm from composability tests (#156036)
Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too

tested locally
<img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036
Approved by: https://github.com/jeffdaily
2025-06-18 04:24:42 +00:00
9ed0060225 Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164)
There are a few considerations here:

1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404

2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.

3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.

4. There are two foot guns:

   a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).

   b. The following seuquence won't work as intended:

```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```

This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.

I think these two foot guns are probably okay given that this a bit of an experts' API.

Fixes #155106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
2025-06-18 03:39:28 +00:00
17b38b850e [ca] Allow using compiled autograd context managers during backward runtime (#156120)
Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check.

FIXES https://github.com/pytorch/pytorch/issues/152219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120
Approved by: https://github.com/jansel
ghstack dependencies: #155521, #155480
2025-06-18 03:01:15 +00:00
10d41c7d20 Add SDPA patterns for T5 models (#155455)
* Add SDPA patterns for T5 models.
* Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-06-18 02:09:55 +00:00
4851863e3f fix hack to check if register_buffer has been overridden (#155963)
Followup on https://github.com/pytorch/pytorch/pull/125971

`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!

Example demonstration:

```python
class Base:
    def register_buffer(self, buffer):
        pass

class InheritedOk(Base):
    pass

class InheritedOverride(Base):
    def register_buffer(self, buffer):
        pass

b = Base()
ok = InheritedOk()
override = InheritedOverride()

print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False

print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```

(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
2025-06-18 01:50:30 +00:00
202d2ae53a Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430)
Fixes #155033

- [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst)
- [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst)
- [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst)
- [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size.
- [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 01:27:04 +00:00
68996dc183 [BE][2/X] Phase out usage of use_max_autotune() (#155848)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848
Approved by: https://github.com/masnesral
2025-06-18 01:18:09 +00:00
e8bfce9a43 Document how to use stack-based APIs with StableIValue (#155984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-06-18 01:10:23 +00:00
541297daae [Build] Allow metal shaders to include ATen headers (#156256)
No-op change that will be used later to share structs between CPU and Metal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256
Approved by: https://github.com/dcci
2025-06-18 01:03:25 +00:00
3dabc351bb [Break XPU] Fix XPU UT failures introduced by community. (#156091)
Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091
Approved by: https://github.com/jansel
2025-06-17 23:43:37 +00:00
38e1e5d54c Add get_pipeline_order() for Gpipe and 1F1B (#155935)
The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR.

The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935
Approved by: https://github.com/wconstab
2025-06-17 23:39:17 +00:00
5435e75399 [ez] rename choice_timings -> choice_timings_fn (#156099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099
Approved by: https://github.com/mlazos
ghstack dependencies: #155982, #155996, #156053
2025-06-17 23:30:27 +00:00
12b02137af [MPS] Add benchmark for scan operations (#156241)
Comparison of cumsum performance before and after Metal implementaton:

Previous performance (using torch==2.7.1):
```[-------------------------------  -------------------------------]
                                              |  eager  |  compile
1 threads: -------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |  131.0  |   136.9
      cumsum-dim0-128x128 (torch.float16)     |  116.9  |   121.2
      cumsum-dim0-512x512 (torch.float16)     |  132.5  |   151.9
      cumsum-dim0-1024x1024 (torch.float16)   |  150.0  |   163.0
      cumsum-dim1-32x32 (torch.float16)       |  125.9  |   140.9
      cumsum-dim1-128x128 (torch.float16)     |  116.4  |   129.4
      cumsum-dim1-512x512 (torch.float16)     |  135.9  |   150.1
      cumsum-dim1-1024x1024 (torch.float16)   |  139.5  |   154.2
      cumsum-1d-100 (torch.float16)           |  119.5  |   127.1
      cumsum-1d-10000 (torch.float16)         |  128.9  |   142.5
      cumsum-1d-1000000 (torch.float16)       |  140.6  |   145.6
      cumsum-dim0-32x32 (torch.float32)       |  115.7  |   132.5
      cumsum-dim0-128x128 (torch.float32)     |  118.0  |   131.5
      cumsum-dim0-512x512 (torch.float32)     |  138.8  |   151.6
      cumsum-dim0-1024x1024 (torch.float32)   |  155.5  |   164.2
      cumsum-dim1-32x32 (torch.float32)       |  127.2  |   141.7
      cumsum-dim1-128x128 (torch.float32)     |  117.7  |   130.5
      cumsum-dim1-512x512 (torch.float32)     |  138.2  |   152.3
      cumsum-dim1-1024x1024 (torch.float32)   |  144.4  |   158.6
      cumsum-1d-100 (torch.float32)           |  118.6  |   128.0
      cumsum-1d-10000 (torch.float32)         |  125.5  |   141.5
      cumsum-1d-1000000 (torch.float32)       |  143.9  |   158.4
      cumsum-dim0-32x32 (torch.bfloat16)      |  106.6  |   137.6
      cumsum-dim0-128x128 (torch.bfloat16)    |  118.1  |   131.0
      cumsum-dim0-512x512 (torch.bfloat16)    |  140.0  |   154.3
      cumsum-dim0-1024x1024 (torch.bfloat16)  |  153.2  |   164.4
      cumsum-dim1-32x32 (torch.bfloat16)      |  127.9  |   132.6
      cumsum-dim1-128x128 (torch.bfloat16)    |  116.5  |   129.6
      cumsum-dim1-512x512 (torch.bfloat16)    |  136.5  |   151.2
      cumsum-dim1-1024x1024 (torch.bfloat16)  |  139.8  |   144.8
      cumsum-1d-100 (torch.bfloat16)          |  115.7  |   129.4
      cumsum-1d-10000 (torch.bfloat16)        |  125.0  |   143.3
      cumsum-1d-1000000 (torch.bfloat16)      |  127.8  |   143.4

Times are in microseconds (us).
```

Current performance:
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |    107.4  |    123.8
      cumsum-dim0-128x128 (torch.float16)     |    134.2  |    145.8
      cumsum-dim0-512x512 (torch.float16)     |    207.3  |    231.6
      cumsum-dim0-1024x1024 (torch.float16)   |    318.9  |    355.3
      cumsum-dim1-32x32 (torch.float16)       |     98.0  |    114.3
      cumsum-dim1-128x128 (torch.float16)     |    110.8  |    121.6
      cumsum-dim1-512x512 (torch.float16)     |    193.0  |    209.1
      cumsum-dim1-1024x1024 (torch.float16)   |    844.7  |    870.8
      cumsum-1d-100 (torch.float16)           |    108.4  |    125.0
      cumsum-1d-10000 (torch.float16)         |    784.7  |    852.3
      cumsum-1d-1000000 (torch.float16)       |  65855.2  |  66725.9
      cumsum-dim0-32x32 (torch.float32)       |    114.7  |    115.7
      cumsum-dim0-128x128 (torch.float32)     |    139.0  |    151.6
      cumsum-dim0-512x512 (torch.float32)     |    197.3  |    208.0
      cumsum-dim0-1024x1024 (torch.float32)   |    312.7  |    332.9
      cumsum-dim1-32x32 (torch.float32)       |     92.0  |    110.8
      cumsum-dim1-128x128 (torch.float32)     |    114.2  |    125.0
      cumsum-dim1-512x512 (torch.float32)     |    186.2  |    196.1
      cumsum-dim1-1024x1024 (torch.float32)   |    752.0  |    825.0
      cumsum-1d-100 (torch.float32)           |    112.4  |    122.0
      cumsum-1d-10000 (torch.float32)         |    793.5  |    863.5
      cumsum-1d-1000000 (torch.float32)       |  66431.8  |  66040.0
      cumsum-dim0-32x32 (torch.bfloat16)      |    111.6  |    121.6
      cumsum-dim0-128x128 (torch.bfloat16)    |    139.0  |    138.4
      cumsum-dim0-512x512 (torch.bfloat16)    |    217.6  |    230.1
      cumsum-dim0-1024x1024 (torch.bfloat16)  |    305.2  |    325.6
      cumsum-dim1-32x32 (torch.bfloat16)      |    100.5  |    110.9
      cumsum-dim1-128x128 (torch.bfloat16)    |    112.8  |    125.0
      cumsum-dim1-512x512 (torch.bfloat16)    |    187.8  |    208.9
      cumsum-dim1-1024x1024 (torch.bfloat16)  |    790.9  |    864.7
      cumsum-1d-100 (torch.bfloat16)          |    111.6  |    124.6
      cumsum-1d-10000 (torch.bfloat16)        |    778.1  |    844.9
      cumsum-1d-1000000 (torch.bfloat16)      |  64654.3  |  64082.5

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
2025-06-17 22:30:22 +00:00
fa4f07b5b8 Revert "[Docs] Convert to markdown to fix 155032 (#155520)"
This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-06-17 22:22:50 +00:00
54998c2daa Document padding size limitations in nn.modules.padding (#134840) (#155618)
Fixes #134840

Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:

- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive

These changes help prevent runtime errors when users attempt to use large padding values.

## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
2025-06-17 22:16:48 +00:00
937529f0b3 Pass by const ref instead of by value in StableIValue from (#156126)
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.

I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).

Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
2025-06-17 22:11:30 +00:00
4c0aa37dda Support stream capture of event record and wait nodes in cuda graphs (#155372)
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
2025-06-17 21:44:51 +00:00
8e02cd9c5a Skip cache related configs for cache config serialization (#156195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195
Approved by: https://github.com/masnesral
2025-06-17 21:24:07 +00:00
3106a33e41 [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156)
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.

Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.

Rollback Plan:

Differential Revision: D76775373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
2025-06-17 21:17:58 +00:00
bb462a6237 [cutlass backend] Fix prescreening non-deterministic problem (#156144)
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)

What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.

However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.

With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
2025-06-17 20:39:06 +00:00
cd66ff8030 [Docs] Convert to markdown to fix 155032 (#155520)
Fix #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-17 20:29:45 +00:00
50940270ae [BE][3/X] Phase out usage of use_max_autotune() (#155849)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849
Approved by: https://github.com/masnesral
2025-06-17 20:26:29 +00:00
b020971e78 [BE] fix typos in torchgen/ (#156083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083
Approved by: https://github.com/jingsh
ghstack dependencies: #156079, #156082
2025-06-17 19:25:50 +00:00
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
ccea6ddac3 [BE] fix typos in cmake/ (#156079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079
Approved by: https://github.com/Skylion007
2025-06-17 19:25:43 +00:00
5eb5c3700b [ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525)
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.

`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2

Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1,  rocSolver <3.26)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
2025-06-17 19:20:36 +00:00
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
4a26bb8a12 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-17 18:59:44 +00:00
fc177801af Enable FP8 row-wise scaled-mm for sm12x (#155991)
## Update using Cutlass 3.x (2025/06/15)

Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91

This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.

## Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).

Simple benchmark script

```python
import torch
from torch._inductor.utils import do_bench_using_profiling

N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
    A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
    B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
    scale_A = torch.ones(M, 1).cuda()
    scale_B = torch.ones(1, N).cuda()

    out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
    out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
    torch.testing.assert_close(out, out_ref)

    latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
    tflops = (2 * M * N * K) / latency_us / 1e9
    print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```

M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS

Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53

Better than default configs, but still far away from the config above for compute-bound

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-06-17 18:52:44 +00:00
e323d46b61 ELU: compute ELU(0) with the cheaper definition (#155765)
Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.)

Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765
Approved by: https://github.com/ezyang
2025-06-17 18:20:22 +00:00
8b0e0e4f23 [dynamo] Support tracing of functools.lru_cached method (#156125)
Fixes https://github.com/pytorch/pytorch/issues/155841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125
Approved by: https://github.com/williamwen42
2025-06-17 18:11:32 +00:00
fc5ae12293 Fix issue with right-nav (#156119)
Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example:
* Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty.
* Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed
<img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" />
vs
<img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119
Approved by: https://github.com/albanD
2025-06-17 18:09:51 +00:00
32c1611263 [CI][run_test] Fix rerun logic for failing at exit (#155853)
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.

The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file).  This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted.  But sometimes pytest reports a success, but the process has non zero exit code.  The two cases I know of are hangs and double freeing at exit.  In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option).  But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly

Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds.  This is correct even in the normal path because initial value shouldn't change in that case

Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this

I think I have a PR that fixes the above but its old so I need to take another look

Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
2025-06-17 17:51:40 +00:00
6629eaf0c6 [CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193)
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`

Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
2025-06-17 17:49:33 +00:00
a4ea242edc [MPS] Implement scan metal kernels (#156100)
Implements metal kernels for scan operations:
- Migrates cumsum and cumprod from MPSGraph implementation to Metal.
- Fixes #154881
- Adds MPS backend support for cummin and cummax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100
Approved by: https://github.com/malfet
2025-06-17 17:44:22 +00:00
9a5c59368d Replace all RAIIATH with Tensor in libtorch_agnostic test, test some APIs (#155977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155977
Approved by: https://github.com/albanD
ghstack dependencies: #155367
2025-06-17 17:36:31 +00:00
b115a4c03a torch::stable::Tensor beginnings, mainly mem mgmt (#155367)
```
// The torch::stable::Tensor class is a highlevel C++ header-only wrapper around
// the C shim Tensor APIs. We've modeled this class after TensorBase, as custom
// op kernels only really need to interact with Tensor metadata (think sizes,
// strides, device, dtype). Other functions on Tensor (like empty_like) should
// live like the ATen op that they are and exist outside of this struct.
//
// There are several goals of this class over AtenTensorHandle and
// RAIIAtenTensorHandle:
// 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the
//    C APIs with AtenTensorHandle. Under the hood we still call to these C shim
//    APIs to preserve stability.
// 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass
//    around ownership. This makes it difficult to pass one input into 2
//    different functions, e.g., doing something like c = a(t) + b(t) for
//    stable::Tensor t. Thus, we use a shared_ptr here.
```

This PR:
- exemplifies the above
- adds test cases in libtorch_agnostic to make sure the file actually works
- includes the results of a battle with template specialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367
Approved by: https://github.com/albanD
2025-06-17 17:36:31 +00:00
2625c70aec Update CODEOWNERS (#156182)
as title says. removing me as codeowner for cpp extensions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156182
Approved by: https://github.com/albanD
2025-06-17 17:15:41 +00:00
a24afbff3f Support torch.cuda.*Tensor in Dynamo (#156107)
Summary:
This PR adds support for torch.cuda.FloatTensor and friends in Dynamo.
These are indeed legacy APIs, but that doesn't stop us from adding
support for them in torch.compile.

I add support for these in the same way that we support torch.Tensor:
these APIs can be safely put into the Dynamo graph.

Fixes #130722

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107
Approved by: https://github.com/williamwen42
2025-06-17 16:31:10 +00:00
9a2c669425 [inductor][cutlass] binary remote cache (#156106)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

Test Plan:
## prove that we can upload successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Rollback Plan:

Reviewed By: henrylhtsang

Differential Revision: D76454741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106
Approved by: https://github.com/henrylhtsang

Co-authored-by: atalman <atalman@fb.com>
2025-06-17 16:24:10 +00:00
d66b4bcc3f [inductor][triton pin] Support triton builtins after #7054 (#156031)
Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder.

To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used.

(Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37`

Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031
Approved by: https://github.com/NikhilAPatel
2025-06-17 16:09:55 +00:00
d083841c0e Fix a small sphinx markup error (#156061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156061
Approved by: https://github.com/colesbury
2025-06-17 15:36:02 +00:00
0079c80b35 [CI] Do not constrain memory for ROCm testing in CI (#156115)
Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115
Approved by: https://github.com/jeffdaily
2025-06-17 15:30:36 +00:00
7fcad0231c [Docs] Convert to markdown to fix 155025 (#155789)
Related to #155025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789
Approved by: https://github.com/svekars
2025-06-17 15:08:14 +00:00
4886ba64dc [BE] Refactor functions from optional_submodules (#155954)
And use `pathlib.Path` instead of `os.path`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155954
Approved by: https://github.com/Skylion007
ghstack dependencies: #155947
2025-06-17 14:41:52 +00:00
cf90c9f8d1 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel, https://github.com/cyyever

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-17 14:15:49 +00:00
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
0a0023d984 Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564)
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
2025-06-17 12:54:58 +00:00
11bb1ece50 [CI] Fix triton version split issue (#155670)
Fix a bug caused by #155313, refer https://github.com/pytorch/pytorch/actions/runs/15576592378/job/43862613039?pr=154194#step:7:652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155670
Approved by: https://github.com/atalman, https://github.com/EikanWang
2025-06-17 12:42:40 +00:00
1cce73b5f4 [build] Change --cmake{,-only} arguments to envvars to support modern Python build frontend (#156045)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045
Approved by: https://github.com/ezyang
ghstack dependencies: #156040, #156041
2025-06-17 11:40:24 +00:00
57084ca846 [BE][setup] allow passing pytorch-specific setup.py options from envvars (#156041)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156041
Approved by: https://github.com/ezyang
ghstack dependencies: #156040
2025-06-17 11:40:24 +00:00
092aed1b18 [Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992)
In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v.
In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR.   SDPA UTs pass in local test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992
Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 11:09:51 +00:00
4a8f5e752b [FSDP2] explain user contract for fully_shard (#156070)
<img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070
Approved by: https://github.com/mori360
2025-06-17 10:03:19 +00:00
8d7ee0f4fb [BE] fix typos in .ci/, .circleci/, .github/ (#156069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156069
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-17 09:43:14 +00:00
2e0e08588e [BE][PYFMT] migrate PYFMT for torch/[e-n]*/ to ruff format (#144553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553
Approved by: https://github.com/ezyang
ghstack dependencies: #144551
2025-06-17 08:18:47 +00:00
cyy
95cb42c45d Use CMAKE_COLOR_DIAGNOSTICS (#154583)
`CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583
Approved by: https://github.com/ezyang
2025-06-17 04:52:31 +00:00
cyy
d43c0bdf46 [CI] Move ASAN jobs to clang-18 (#149099)
Use clang-18 for ASAN jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149099
Approved by: https://github.com/ezyang
2025-06-17 04:51:07 +00:00
7b0118884e [invoke_subgraph][inductor] Dont fallback on complex dtype (#155885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155885
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #155828
2025-06-17 04:47:12 +00:00
ffcc6fea7b [invoke_subgraph] Ignore redundantly nested invoke_subgraph (#155828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155828
Approved by: https://github.com/zou3519
2025-06-17 04:47:12 +00:00
b1713c6655 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
ghstack dependencies: #156121
2025-06-17 04:46:26 +00:00
82672206b7 [SymmMem] Make get_rank_to_global_rank return const ref (#156117)
Avoiding a copy, not expecting a caller to change its value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156117
Approved by: https://github.com/fegin
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116
2025-06-17 04:13:18 +00:00
eea3bcb3d1 [SymmMem] Cache rank_to_global_rank exchange (#156116)
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.

Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```

After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
2025-06-17 04:12:37 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
45382b284d [cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875)
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.

Also I want to change how cutlass .so are compiled. Hence this change.

This change is broken down since @coconutruben is trying to make a change to the same files too.

Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
2025-06-17 02:06:54 +00:00
cyy
64bb6317a5 [Accelerator] Fix Python typing in accelerator (#152394)
There are some changes:
1. Use keywords for arguments if possible.
2. `__exit__ ` of `device_index` is changed to return None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152394
Approved by: https://github.com/XuehaiPan, https://github.com/guangyey, https://github.com/ezyang

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 01:27:40 +00:00
1f0eb79e3e [dynamo] fix KeyError in LOAD_FAST_CHECK (#155763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155763
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #155761
2025-06-17 00:54:16 +00:00
4e833c2005 [dynamo] support tracing weakref callback (#155761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155761
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-17 00:54:16 +00:00
e6252f62ef [ONNX] Implements converter for higher order ops scan (#154513)
Fixes #151327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154513
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-17 00:54:07 +00:00
b618817479 [PGO] include ints/floats in suggested whitelist (#155980)
Made the mistake of dropping these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155980
Approved by: https://github.com/bobrenjc93
2025-06-17 00:41:38 +00:00
4311aea5e7 [AOTInductor] Add class declarations to torch._C._aoti interface file (#155128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155128
Approved by: https://github.com/desertfire
ghstack dependencies: #155149
2025-06-17 00:10:57 +00:00
82fb904140 Add warning for incorrected grad results at world size 1 (#154928)
Add warning for the issue discussed at https://github.com/pytorch/pytorch/issues/144045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154928
Approved by: https://github.com/weifengpy
2025-06-17 00:08:04 +00:00
eb4cf59ecd Add FSDP2 logging (#155826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155826
Approved by: https://github.com/weifengpy
2025-06-16 23:49:58 +00:00
6e2992a998 Remove unused Azure pipeline trigger script (#156134)
## Summary
- delete `.circleci/scripts/trigger_azure_pipeline.py`

## Testing
- `python3 -m pip install flake8`
- `python3 -m flake8 .circleci/scripts`

------
https://chatgpt.com/codex/tasks/task_e_6850a55f530c83279036800308fb6871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156134
Approved by: https://github.com/izaitsevfb
2025-06-16 23:42:52 +00:00
4781b0ee60 [SymmMem] Add NVSHMEM GET support to Triton (#155890)
Adds NVSHMEM GET operation support for Triton kernels:

- Add `getmem_block` core.extern wrapper for nvshmemx_getmem_block
- Add basic `test_triton_get` for 2-rank GET operation
- Add `test_triton_get_ring` for ring topology GET across arbitrary ranks

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_get`

```python
@skipIfRocm
@requires_triton()
def test_triton_get(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   val = 7
   inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(
       val if rank == 0 else -1
   )
   out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)

   peer = 1 - rank
   if rank == 1:
       # Rank 1 gets data from rank 0
       get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")
```

```

[Rank 0] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 1] inp buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:1', dtype=torch.int8)
...
[Rank 1] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:1', dtype=torch.int8)
...
[Rank 1] got data from peer 0

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

```python
@skipIfRocm
@requires_triton()
def test_triton_get_ring(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   # Ring topology: each rank gets data from the rank to its left
   peer = (rank - 1) % world_size

   # All ranks execute the get operation
   get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")

```

```
Output (8 GPUs):

[Rank 0] inp buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int8)
[Rank 2] inp buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:2', dtype=torch.int8)
[Rank 5] inp buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:5', dtype=torch.int8)
[Rank 6] inp buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:6', dtype=torch.int8)
[Rank 3] inp buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:3', dtype=torch.int8)
[Rank 1] inp buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:1', dtype=torch.int8)
[Rank 2] out buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2', dtype=torch.int8)
[Rank 5] out buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:5', dtype=torch.int8)
[Rank 0] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 2] got data from peer 1
[Rank 5] got data from peer 4
[Rank 0] got data from peer 7
[Rank 7] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:7', dtype=torch.int8)
[Rank 6] out buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:6', dtype=torch.int8)
[Rank 3] out buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:3', dtype=torch.int8)
[Rank 6] got data from peer 5
[Rank 3] got data from peer 2
[Rank 1] out buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 4] inp buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:4', dtype=torch.int8)
[Rank 7] out buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:7', dtype=torch.int8)
[Rank 7] got data from peer 6
[Rank 4] out buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:4', dtype=torch.int8)
[Rank 4] got data from peer 3

----------------------------------------------------------------------
Ran 1 test in 41.045s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155890
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-16 23:18:15 +00:00
bb1f3d1a55 [MPSInductor] Improve _default dtype inference (#156121)
By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121
Approved by: https://github.com/dcci
2025-06-16 23:11:53 +00:00
508cdc4fc9 [BE][4/X] Phase out usage of use_max_autotune() (#155850)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155850
Approved by: https://github.com/masnesral
2025-06-16 23:10:26 +00:00
f2d70898c6 [nativert] Move OpKernel to PyTorch core (#156011)
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test

Rollback Plan:

Differential Revision: D76525939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
2025-06-16 22:53:10 +00:00
35ecd7c2d4 Revert "[Cutlass] Fix buffer missing issues (#155897)"
This reverts commit 9bd42c15707a4b410ee005d5916e882a7db432bb.

Reverted https://github.com/pytorch/pytorch/pull/155897 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/155897#issuecomment-2978391416))
2025-06-16 22:44:39 +00:00
190f76fa31 Revert "Implement guard collectives (#155558)"
This reverts commit 5a5a05a6a3be376130848e235df73b752eef0230.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/malfet due to Hmm, may be I'm looking at the wrong metric, but c92f1075aa/1 shows that test started to pass after PR were reverted ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2978337152))
2025-06-16 22:26:52 +00:00
c92f1075aa Fix if condition for CUDA 12.9 Win build (#156108)
follow-up for https://github.com/pytorch/pytorch/pull/155799/files
Currently the last if condition will be executed for CUDA 12.9, overriding previous CUDA_ARCH_LIST. We should exclude 12.9 from the last if condition to fix this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156108
Approved by: https://github.com/atalman
2025-06-16 21:57:34 +00:00
cce4d213a6 Remove non-header-only dep from c10_headers target (#155858)
It depends on /c10/util:base which is not header-only.

Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858
Approved by: https://github.com/ezyang
2025-06-16 21:41:25 +00:00
a24ce67dee [ez] fix grammar error in comment (#156053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156053
Approved by: https://github.com/jingsh
ghstack dependencies: #155982, #155996
2025-06-16 20:53:07 +00:00
247113e03e Add size_hint_or_throw (#155615)
## Summary
`TypeError("Cannot convert symbols to int")` is coming up more recently since more unbacked symints are making its way into Inductor. See https://github.com/pytorch/pytorch/issues/155484
- One way to deal with this is to add `size_hint_or_throw` to throw if we try to pull a hint from an unbacked expr.
- Then, repurpose `size_hint` to accommodate unbacked symints by setting a default fallback or adding an appropriate fallback for each callsite.

This PR adds `size_hint_or_throw` which will throw if unbacked symints exist
- use `size_hint_or_throw` -- usually when the callee can try/catch the exception or guards against unbacked symints

------
with Codex
https://chatgpt.com/codex/tasks/task_e_684869dfc740832882c88d05534cc8f9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155615
Approved by: https://github.com/ezyang, https://github.com/laithsakka, https://github.com/jingsh

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-16 20:46:51 +00:00
008345be9d Fix #155018 (convert distributed rst to markdown) (#155528)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155018

Docs comparison (check out the 'new' whenever docs build)

1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html))
2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html))
3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html))
4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html))
5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528
Approved by: https://github.com/wz337, https://github.com/svekars
2025-06-16 20:46:09 +00:00
eb2af14f8e [PT2][partitioners] Add aten.split to view_ops list [relanding #155424] (#155943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943
Approved by: https://github.com/ShatianWang
2025-06-16 20:42:54 +00:00
03488d820c Revert "[MPS][Testing][BE] Fix samples for full_like (#156026)"
This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06.

Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](2d832c9587) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))
2025-06-16 19:50:26 +00:00
6d2155db49 [PGO] no code state update on dynamic=False (#155961)
Summary:
When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic.

A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs.

Test Plan:
test/dynamo/test_pgo.py

Rollback Plan:

Differential Revisi,on: D76630499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961
Approved by: https://github.com/bobrenjc93
2025-06-16 19:47:55 +00:00
5a5a05a6a3 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 19:46:16 +00:00
61b271e0f3 Revert "Implement guard collectives (#155558)"
This reverts commit 38e5e81e55fc5d85d6cf8a83c96c88578995e3fe.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/atalman due to Breaks CI, sorry: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683161593/job/44181274826) [HUD commit link](38e5e81e55) ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2977871178))
2025-06-16 19:40:46 +00:00
7cf38d2a05 Make benchmark by op for TS model work with sample inputs (#155988)
Summary: Add pickle input type to allow for running ptvsc2_predictor_bench to get individual node benchmarks for SR

Test Plan:
```
buck2 run mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/data/users/georgiaphillips/models/742055223/1/742055223_1.predictor.local --pt_inputs=/data/users/georgiaphillips/models/742055223/0/mix.pt --pt_enable_static_runtime=1 --compare_results=0 --iters=1000 --warmup_iters=100 --num_threads=1 --do_profile=1 --method_name=${MODULE_NAME}.forward --set_compatibility --do_benchmark=1 --pytorch_predictor_default_model_id=${MODEL_ENTITY_ID}_${SNAPSHOT_ID} --input_type=pickle
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76554920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155988
Approved by: https://github.com/dolpm
2025-06-16 19:15:07 +00:00
2dc1627451 [doc] Add documentation for division by zero behavior in autograd (#155987)
Fixes #128796

This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:

1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
   - Masking before division
   - Using MaskedTensor (experimental API)

The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.

This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.

dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
2025-06-16 19:02:12 +00:00
907d0931cc [ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480)
For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480
Approved by: https://github.com/jansel
ghstack dependencies: #155521
2025-06-16 18:45:03 +00:00
9ff9c28fe8 [ca] Functionalize AccumulateGrad (#155521)
This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up.

Before:
- Compiled Autograd Engine: proxies call to inductor accumulate_grad op
- Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract)
```python
        new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1);  getitem_1 = None
        copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1);  copy_ = None
```
- AOTAutograd: functionalizes the copy_

After:
- Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call.
- Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd
- AOTAutograd: traces into the op, with FunctionalTensors.

While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case.

In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521
Approved by: https://github.com/jansel
2025-06-16 18:45:02 +00:00
42ff6a4a5c [Inductor] Delay codegen for fallback arguments and improve typing (#154371)
Delays code generation for arguments to fallback ops.  This is inspired by #155642, and likely fixes similar memory leaks.

Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-06-16 18:00:04 +00:00
4162c0f702 [BE][setup] gracefully handle envvars representing a boolean in setup.py (#156040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156040
Approved by: https://github.com/malfet
2025-06-16 17:56:31 +00:00
f48a157660 [aoti] Add more to error message (#155974)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155974
Approved by: https://github.com/yushangdi
2025-06-16 17:49:52 +00:00
fbd88ae2b5 Convert to markdown: checkpoint.rst (#156009)
Related to #155014

Use two commits to have a try.
```bash
 1800  git mv docs/source/checkpoint.rst docs/source/checkpoint.md
 1802  git commit -m "[Docs] Rename checkpoint.rst"
 1803  git push origin ckpoint

# update the markdown file
 1805  git add .
 1806  git commit -m "modify checkpoint.md"
 1807  git push origin ckpoint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009
Approved by: https://github.com/svekars
2025-06-16 17:48:23 +00:00
a10024d7de Convert complex_numbers.rst to markdown (#156039)
Related to #155014

Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039
Approved by: https://github.com/svekars
2025-06-16 17:24:37 +00:00
e9fdaf8701 Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)"
This reverts commit e375d21bb9b0ef6fefe7a8af5a054a17de8c63c9.

Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/malfet due to Looks like it broke ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-2977428354))
2025-06-16 17:22:55 +00:00
45596ec58f Delete tools/onnx/update_default_opset_version.py (#156055)
The tool is no longer relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156055
Approved by: https://github.com/titaiwangms
2025-06-16 17:21:36 +00:00
365ce465f3 Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)"
This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7.

Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](8142a02860) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))
2025-06-16 16:59:25 +00:00
2a4e357192 Fix compilation warning with gcc14 (#155934)
Note that nccl still doesn't work so you have to build with `USE_NCCL=0` @eqy is that something being tracked there?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155934
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-06-16 16:43:15 +00:00
503362d019 Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)"
This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968.

Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))
2025-06-16 15:13:53 +00:00
b8d96c3f78 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit 47c8810b5275179833d6b33ca3d70922f485272c.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/atalman due to failing torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2976990461))
2025-06-16 14:59:01 +00:00
013dfeabb4 [BE] fix typos in top-level files (#156067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067
Approved by: https://github.com/malfet
ghstack dependencies: #156066
2025-06-16 14:56:07 +00:00
6c493e2b14 [BE] add codespell linter (#156066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156066
Approved by: https://github.com/malfet
2025-06-16 14:56:07 +00:00
2d832c9587 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
2025-06-16 14:27:42 +00:00
831c9010c7 [BE] Remove non-existing operator from unimplemented list (#156025)
Never heard of torch.login :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025
Approved by: https://github.com/dcci
2025-06-16 14:14:58 +00:00
38e5e81e55 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 14:09:14 +00:00
05faba4028 Bump requests from 2.32.2 to 2.32.4 in /.github (#155491)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-16 06:48:08 -07:00
d6ee5144ca [xla hash update] update the pinned xla hash (#156064)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156064
Approved by: https://github.com/pytorchbot
2025-06-16 11:11:10 +00:00
8142a02860 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-16 10:55:47 +00:00
bf7e290854 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/clee2000
2025-06-16 10:28:45 +00:00
f810e98143 [ONNX] Update default opset to 18 (#156023)
Update default opset for the torchscript exporter to 18 to match the dynamo exporter, because support was actaully added and tested in https://github.com/pytorch/pytorch/pull/118828. In the next version we should plan to update to opset 21 or higher. This change also removes the hard limit on the torchscript exporter for more flexibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156023
Approved by: https://github.com/Skylion007
2025-06-16 08:40:49 +00:00
39c605e8b3 remove allow-untyped-defs from context.py (#155622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155622
Approved by: https://github.com/Skylion007
2025-06-16 07:38:34 +00:00
d9799a2ee7 Support boolean tensor for torch.fused_moving_avg_obs_fake_quant on CUDA (#153699)
Fixes #153310

As the title

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fused_obs_fake_quant_moving_avg
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153699
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
2025-06-16 07:10:06 +00:00
156b28e62a [audio hash update] update the pinned audio hash (#155648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155648
Approved by: https://github.com/pytorchbot
2025-06-16 03:57:28 +00:00
c620d0b5c7 convert: rst to myst pr2/2 (#155911)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following three .rst files with the mentioned referenced in each file

- [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst)
  - torch.compiler_troubleshooting
  - nonsupported_numpy_feats
  - torchdynamo_fine_grain_tracing

- [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst)
  - None

- [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst)
  - torch.compiler_overview
  - torch.compiler_api
  - torchdynamo_fine_grain_tracing

I made the suggested edits by the maintainers as commented in the parent PR
(used git mv on all files, yet it still appeared as delete-create action)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-16 00:44:44 +00:00
c83041cac2 [test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827)
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```

These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
2025-06-15 20:24:19 +00:00
bc9b8ea230 [user triton] JIT inductor support for new host-side TMA api (#155814)
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.

* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)

AOTI support is not implemented yet.

Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
2025-06-15 20:24:19 +00:00
b7c95acc6c [user triton] triton_kernel_wrap support for new host-side TMA API (#155777)
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.

Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)

Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
2025-06-15 20:24:19 +00:00
54976bca10 [dynamo] Provide helper functions for guard filter hook (#155083)
Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083
Approved by: https://github.com/zhxchen17, https://github.com/jansel
2025-06-15 17:49:36 +00:00
0935a97d95 [Dynamo] Add torch.accelerator API to trace_rules (#155884)
# Motivation
- Add binding API and non-binding API in torch.accelerator to trace rules.
- Add some function in torch.accelerator to const fold functon list for Dynamo capature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155884
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787, #155788
2025-06-15 17:09:57 +00:00
b51d803785 [Dynamo] Add XPU API to trace_rules (#155788)
# Motivation
- Add binding API and non-bindling API to trace rules for XPU;
- Add some XPU API to the const fold function for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155788
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787
2025-06-15 17:09:57 +00:00
69acba2b19 [Dynamo] Add generic and XPU-specific Stream&Event in UserDefineClass (#155787)
# Motivation
- Add XPU-specific Stream and Event to in graph calss list for Dynamo capture.
- Add generic Stream and Event to i graph class list for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155787
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-06-15 17:09:57 +00:00
53cd18f6b3 Update gradient behavior note in torch.amin and torch.amax (#155071)
Fixes #155048

The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change:

New note:
`amax, amin, max(dim), min(dim) evenly distributes gradient between equal values
        when there are multiple input elements with the same minimum or maximum value.`

cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071
Approved by: https://github.com/soulitzer
2025-06-15 16:09:31 +00:00
655b3b14ff [executorch hash update] update the pinned executorch hash (#156007)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156007
Approved by: https://github.com/pytorchbot
2025-06-15 04:51:37 +00:00
517d2995e0 Add__int__ and __float__ methods to _sympy.functions.Identity (#155873)
Fixes #155688

Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`.  This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.

Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
2025-06-15 04:24:40 +00:00
6ebe9a4f47 [BE][Ez]: Optimize nvshmem alloc with missing move (#156000)
Saw this in another PR where there was a missing move on this potentially very hot path with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156000
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-06-15 03:04:08 +00:00
32eee8ed22 [SymmMem] Add nvshmem_free (#155975)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed.

Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155975
Approved by: https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971
2025-06-15 01:23:49 +00:00
b8aee84fb9 [c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949)
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
2025-06-15 00:37:42 +00:00
3159ee2ad3 Update test_schedule_multiproc to use world_size=2 (#155921)
The multiproc schedule tests previously ran with world_size=2, and PP tests became flakier due to the longer pipeline execution, this is restoring previously behavior. This will fix the tests (https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

In follow up PRs I will refactor the tests and move some tests to use large world sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155921
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
ghstack dependencies: #155920
2025-06-15 00:24:18 +00:00
8e1471bdc9 Allow MultiProcContinuousTest to set world_size (#155920)
`MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920
Approved by: https://github.com/fduwjj
2025-06-15 00:24:17 +00:00
9bd42c1570 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-15 00:08:50 +00:00
a35b3a9b95 [cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939)
Differential Revision: D76571828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-06-15 00:01:57 +00:00
eqy
47c8810b52 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-14 23:34:31 +00:00
0fa361e429 [ez] fix typo in _inductor/scheduler.py (#155996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155996
Approved by: https://github.com/Skylion007
ghstack dependencies: #155982
2025-06-14 21:21:35 +00:00
77ac3a0965 [SymmMem] Remove wrappers around nvshmem APIs (#155971)
`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018.

Therefore there is no need to build an extra layer to hide dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155971
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968
2025-06-14 19:58:09 +00:00
2c0d94a7de [SymmMem] Remove unused ptr_to_symm_mem_ (#155968)
No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty.
This PR removes it and supports relying functionalities via the `allocations_` map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155968
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835
2025-06-14 19:57:06 +00:00
a317c63d1b [BE]: Update NCCL to 2.27.3 (#155233)
Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517

This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233
Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy
2025-06-14 19:20:31 +00:00
794ef6c9b8 Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 19:14:31 +00:00
5285d10243 remove duplicated pybind flag in mps code (#155936)
gcc14 (at least) warns that this is already defined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155936
Approved by: https://github.com/Skylion007
2025-06-14 18:41:12 +00:00
e95e8eed0a mypy 1.16.0 (#155821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-14 18:18:43 +00:00
ce79056471 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-14 17:29:54 +00:00
603a54a9b3 Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.

This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
-  it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals

I am also seeing a couple of wrong usages !! that i will fix  in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154774
2025-06-14 17:13:53 +00:00
c219dbd2fc avoid gso in has_internal_overlap (#155870)
existing comment already explains it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155870
Approved by: https://github.com/bobrenjc93
2025-06-14 17:13:20 +00:00
279cae52e7 [BE][PYFMT] migrate PYFMT for torch/ao/ to ruff format (#148185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148185
Approved by: https://github.com/ezyang
2025-06-14 16:47:04 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
370fc49dde Handle aten.to at submodule boundaries (#153972)
Summary: #buildall

Test Plan: CI

Differential Revision: D74582970

When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972
Approved by: https://github.com/avikchaudhuri
2025-06-14 16:13:29 +00:00
d42c11819f [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-06-14 16:09:41 +00:00
70b68caf58 Fix logging of failed tensorified ops (#155982)
Tested via

```
TORCH_LOGS="torch.fx.passes._tensorify_python_scalars" tlp python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unspecialized_float_fallback_symint_specialization
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function pow>
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function floor>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155982
Approved by: https://github.com/flaviotruzzi
2025-06-14 14:23:54 +00:00
e375d21bb9 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-14 14:12:38 +00:00
1a568f4e5d [BE][Easy] bump isort to 6.0.1 (#155919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155919
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909, #155914
2025-06-14 12:29:01 +00:00
5467765990 [BE][Easy] bump ruff to 0.11.13 (#155914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155914
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909
2025-06-14 12:29:01 +00:00
736a15a81a [torchgen] Fix ruff format for # fmt: skip comment for function signature (#155909)
See also:

- astral-sh/ruff#18658

This fix follows the suggestion from:

- https://github.com/astral-sh/ruff/issues/18658#issuecomment-2970130276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155909
Approved by: https://github.com/ezyang
2025-06-14 12:28:55 +00:00
d859e65826 [DCP][Ez]: Fix broadcast_object bug in DCP utils (#155912)
Fixes #152310. Broadcast_object is now symmetric with gather_object and scatter_object. It was likely a typo that wasn't fixed in https://github.com/pytorch/pytorch/pull/147675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155912
Approved by: https://github.com/ezyang
2025-06-14 12:14:14 +00:00
596b418391 [BE][PYFMT] migrate PYFMT for {torch,test}/{nn,optim}/** to ruff format (#144548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548
Approved by: https://github.com/ezyang
2025-06-14 11:27:04 +00:00
3e38feb05f [inductor] Add configuration control for CUTLASS operation selection. (#155770)
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.

**Fixes #155718**

## Usage Examples

```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"

# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"

# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"

# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
2025-06-14 08:19:54 +00:00
1982ec2d22 Add api info for torch._C._nn.pyi (#148405)
APis involved are as followed:

- adaptive_avg_pool2d
- adaptive_avg_pool3d
- binary_cross_entropy
- col2im

ISSUE Related:
https://github.com/pytorch/pytorch/issues/148404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405
Approved by: https://github.com/ezyang
2025-06-14 07:57:07 +00:00
7070ab3180 use guard_or_false in checkInBoundsForStorage (#155874)
this was added in https://github.com/pytorch/pytorch/pull/147354, the comment already justify guard_or_false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155874
Approved by: https://github.com/bobrenjc93
2025-06-14 07:21:26 +00:00
d79651571f assume sparse tensor not coalesced_ gsv -> guard_or_false. (#155869)
preserve current behavior. Generalize it such that no need for torch._check_is_size to opt into this,
and make it work for more complex unbacked sizes with ranges [-inf, inf]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155869
Approved by: https://github.com/bobrenjc93
2025-06-14 07:19:56 +00:00
e7da21806f [Easy][BE] update recommanded VS Code settings (#152760)
Changes:

- Remove old invalid settings and replace with new settings.
- Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760
Approved by: https://github.com/drisspg
2025-06-14 07:11:10 +00:00
cyy
1393f71e07 Use CUDA language in generated CMakeLists.txt from cpp_builder.py (#155979)
The CMake CUDA module has been deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155979
Approved by: https://github.com/ezyang
2025-06-14 06:52:51 +00:00
c843909d9e [flex attention][triton pin] use new TMA API (#155771)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API).

This PR does the following:
* replace the experimental TMA APIs with the stable TMA APIs
* remove the workspace args.

Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone]

Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771
Approved by: https://github.com/mandroid6, https://github.com/nmacchioni
2025-06-14 06:34:16 +00:00
92b7ed6d07 Add Helion softmax test (#155976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155976
Approved by: https://github.com/jansel
2025-06-14 05:53:21 +00:00
9338d85d45 [ProcessGroupNCCL] Added log when fr dump triggered from pipe (#155754)
Summary:
TSIA

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
eyes

Sandcastle run

Differential Revision: D76472617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155754
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
2025-06-14 04:34:29 +00:00
bf897b4cea [ONNX] Support 0/1 on dynamic dimension (#155717)
Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export.

However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export.

For example,

```python
import torch

class Model(torch.nn.Module):
    def forward(self, x, y, z):
        return torch.cat((x, y), axis=1) + z

model = Model()
x = torch.randn(2, 3)
y = torch.randn(2, 5)
z = torch.randn(1, 8)
model(x, y, z)

DYN = torch.export.Dim.DYNAMIC
ds = {0: DYN, 1: DYN}

with torch.fx.experimental._config.patch(backed_size_oblivious=True):
    ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds))

print(ep)
"""
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"):
             #
            sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0)
            sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1)
            sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0)
            sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1)
            sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0)
            sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1)

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1);  x = y = None

             #
            eq: "Sym(True)" = sym_size_int_2 == sym_size_int;  sym_size_int_2 = None
            _assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'");  eq = _assert_scalar_default = None
            add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3;  sym_size_int_1 = sym_size_int_3 = None
            eq_1: "Sym(True)" = add_1 == sym_size_int_5;  add_1 = sym_size_int_5 = None
            _assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'");  eq_1 = _assert_scalar_default_1 = None
            eq_2: "Sym(True)" = sym_size_int == sym_size_int_4;  sym_size_int = sym_size_int_4 = None
            _assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'");  eq_2 = _assert_scalar_default_2 = None

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z);  cat = z = None
            return (add,)

Graph signature:
    # inputs
    x: USER_INPUT
    y: USER_INPUT
    z: USER_INPUT

    # outputs
    add: USER_OUTPUT

Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]}
"""
ep.module()(x, y, z)
"""
Traceback (most recent call last):
  File "/home/titaiwang/pytorch/test_export.py", line 20, in <module>
    ep.module()(x, y, z)
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__
    raise e
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl
    return inner()
           ^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook
    _check_input_constraints_for_graph(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph
    _check_symint(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint
    raise RuntimeError(
RuntimeError: Expected input at *args[2].shape[0] to be equal to 2, but got 1
"""
```

The explanation (from @pianpwk):

In the model we have `return torch.cat((x, y), axis=1) + z`.

Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are:

1) Yes -> then we must specialize `s8` to 1
2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`.

Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2).

For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717
Approved by: https://github.com/justinchuby
2025-06-14 04:04:47 +00:00
187828dcb4 [OpenReg][5/N] add set_.source_Storage for openreg (#155191)
**Changes**:
- add set_.source_Storage for openreg to support torch.load & torch.serialization
- uncomment some related tests in the test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155191
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181, #155101
2025-06-14 03:44:32 +00:00
e4fd0bf771 [OpenReg][4/N] Migrate cpp_extensions_open_device_registration to OpenReg (#155101)
As the title stated.

**Involved testcases**:
- test_open_device_storage_pin_memory
- test_open_device_serialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155101
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181
2025-06-14 03:44:32 +00:00
1e7989cad5 [OpenReg][3/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154181)
As the title stated.

**Involved testcases**:
- test_open_device_quantized
- test_open_device_random
- test_open_device_tensor
- test_open_device_packed_sequence
- test_open_device_storage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154181
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106
2025-06-14 03:44:32 +00:00
7e5f29b2de [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154106)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154106
Approved by: https://github.com/nareshrajkumar866, https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019
2025-06-14 03:44:32 +00:00
676abded4b [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154019)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154019
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018
2025-06-14 03:44:32 +00:00
d3d469092f [Openreg] Split TestOpenReg into two parts (#154018)
----

- TestPrivateUse1: testing 3rd accelerator integration mechinasm itself
- TestOpenReg: testing openreg itself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154018
Approved by: https://github.com/albanD
ghstack dependencies: #153947
2025-06-14 03:44:31 +00:00
cafd2344d6 [OpenReg] add manual_seed related capabilities (#153947)
**Changes**:
- Add manual_seed manual_seed_all initial_seed and so on
- Delay execution of self._lazy_init more deeply
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153947
Approved by: https://github.com/albanD
2025-06-14 03:44:31 +00:00
297805fd8f Typo fixes for "overridden" in comments and function names (#155944)
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
2025-06-14 03:37:38 +00:00
ca3cabd24a Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696)
Fixes #155028

This pull request updates the documentation  by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-14 03:32:00 +00:00
cdfa33a328 [nativert] move execution frame to torch (#155830)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76369008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155830
Approved by: https://github.com/zhxchen17
2025-06-14 03:28:55 +00:00
a6084b71ed [BE][1/X] Phase out usage of use_max_autotune() (#155847)
`use_max_autotune()` is likely not what people expect it to be;

Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning.

Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847
Approved by: https://github.com/jingsh, https://github.com/masnesral
2025-06-14 03:16:20 +00:00
7982b8c703 [BE][AOTI] Remove duplicate schema for ExternKernelNode (#155867)
Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76558294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867
Approved by: https://github.com/jingsh
2025-06-14 02:03:27 +00:00
8f5f01bf19 [BE][AOTI] Combine DynamicArgType in Proxy Executors (#155871)
Summary:
As title.

Move the duplicate definition to the base class header `proxy_executor.h`

Test Plan:
CI

Rollback Plan:

Differential Revision: D76559180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155871
Approved by: https://github.com/yushangdi
2025-06-14 01:52:43 +00:00
4574b39aa4 Revert "[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)"
This reverts commit bbbced94a43cf764ddfe719e7d4c161a3992830c.

Reverted https://github.com/pytorch/pytorch/pull/155709 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15645591737/job/44082402642) [HUD commit link](bbbced94a4) landrace with 155819? easy forward fix but its the end of the week so idk when id get a review ([comment](https://github.com/pytorch/pytorch/pull/155709#issuecomment-2972094849))
2025-06-14 01:43:16 +00:00
c10339559d [BE] Better uv detection in pip init (#155972)
If one has some UV and non-UV environments locally, one shoudl call `uv
pip install` only on the UV-enabled ones, which could be detected by
checking if `uv/python` path is present in `sys.base_prefix`

Fixes https://github.com/pytorch/pytorch/issues/152999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155972
Approved by: https://github.com/janeyx99
2025-06-14 01:35:50 +00:00
d7e3c9ce82 Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287)"
This reverts commit 3b6569b1ef4b9ff25f5b75fe0a216d6d084d573f.

Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](3b6569b1ef) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))
2025-06-14 01:32:27 +00:00
c165b36a31 [MTIA Aten Backend] Migrate relu / relu_ (#155927)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate relu / relu_.

Note: Pytorch in-tree implementation delegates relu to clamp_min, so no more need to launch relu kernel.
https://www.internalfb.com/code/fbsource/[0c9eedb2fc8f99bcca00cb67a5738cfe07e39349]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520

Let me know if any concern about this

Differential Revision: [D75803582](https://our.internmc.facebook.com/intern/diff/D75803582/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155927
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925, #155926
2025-06-14 01:24:48 +00:00
50f6431e0a [MTIA Aten Backend] Migrate sqrt.out / rsqrt.out / sin.out / silu.out (#155926)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate sqrt.out / rsqrt.out / sin.out / silu.out

Differential Revision: [D75801847](https://our.internmc.facebook.com/intern/diff/D75801847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155926
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925
2025-06-14 01:24:48 +00:00
7b11cb8c12 [MTIA Aten Backend] Migrate tanh.out and tanh_backward.grad_input (#155925)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate tanh.out and tanh_backward.grad_input

Differential Revision: [D75769242](https://our.internmc.facebook.com/intern/diff/D75769242/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155925
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659
2025-06-14 01:24:41 +00:00
0185d3a5ed [MTIA Aten Backend] Migrate bitwise_or.Tensor_out (#154659)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_or.Tensor_out from out-of-tree to in-tree.

Differential Revision: [D75629937](https://our.internmc.facebook.com/intern/diff/D75629937/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154659
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632
2025-06-14 01:24:34 +00:00
163cdaaa3a [MTIA Aten Backend] Migrate bitwise_not.out (#154632)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_not.out from out-of-tree to in-tree.

Differential Revision: [D75610643](https://our.internmc.facebook.com/intern/diff/D75610643/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154632
Approved by: https://github.com/egienvalue
2025-06-14 01:24:27 +00:00
04cf2c9d24 fix tensor print behavior for MAIA (#155609)
This pull request fixes the tensor print behavior for `MAIA` to account for the absence of double-precision support in its backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155609
Approved by: https://github.com/soulitzer
2025-06-14 01:04:12 +00:00
dabb55baff Add resolve in add decomp to enable view (#153945)
Fixes #148950.

During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301
), the functional argument of conj complex tensor gets cloned. This result in always having *.is_conj()* evaluted to false in decomposition function.

Propose a fix of calling resolve_conj() in the decomposition of complex tensor add.

Test as below
`python test/dynamo/test_repros.py ReproTests.test_add_complex_conj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945
Approved by: https://github.com/jansel
2025-06-14 00:41:50 +00:00
fec571cfd4 [BE][CI] Remove hardshrink integer exclusions (#155965)
As they are not called anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965
Approved by: https://github.com/dcci
2025-06-14 00:32:57 +00:00
38410cf9b5 Fix DDPOptimizer issue on static tensor index (#155746)
We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)

The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules.
bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)

Fixes #140395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746
Approved by: https://github.com/xmfan, https://github.com/bdhirsh
2025-06-14 00:15:58 +00:00
3b6569b1ef Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 00:05:57 +00:00
bbbced94a4 [BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)
followup for https://github.com/pytorch/pytorch/pull/154980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155709
Approved by: https://github.com/tinglvv, https://github.com/atalman, https://github.com/nWEIdia, https://github.com/cyyever
2025-06-13 23:10:01 +00:00
d512584718 [BE] Refactor clamp dtypes check (#155930)
By introducing `check_for_unsupported_clamp_dtypes` similar to `check_for_unsupported_isin_dtypes`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155930
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/clee2000
ghstack dependencies: #155470
2025-06-13 23:05:02 +00:00
0cb85c188f [BE] Move optional submodules checkout to its own module (#155947)
To expand it to optional eigen checkout later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155947
Approved by: https://github.com/Skylion007
2025-06-13 23:02:38 +00:00
3003c681ef Converting .rst files to .md files (#155377)
Fixes #155036
This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:54:27 +00:00
799443605b Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297)
Fixes #155019

## Description
Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst

## Checklist
- [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html)
- [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html)
- [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html)
- [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:08:37 +00:00
764c02b78b [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-13 22:07:03 +00:00
d59ed21d0f [CI] Reuse old whl: track why failed to use the old whl (#155860)
As in title
Any other things I should track?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155860
Approved by: https://github.com/malfet
2025-06-13 22:01:31 +00:00
3596c0c77f Fix test after revert (#155946)
ex
test_dynamic_shapes.py::TestUbackedOps::test_unbacked_reshape2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15642199583/job/44073674212) [HUD commit link](06408dae49)

started after 06408dae49d06b6146fdd9d7a37eb5dde4f5e78d

idk what the test does so maybe theres a better way to fix this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155946
Approved by: https://github.com/yangw-dev, https://github.com/huydhn, https://github.com/malfet
2025-06-13 21:52:07 +00:00
eef253d9f6 [CI] Keep going display on HUD: upload log when test fails (#155371)
I guess this is more of an RFC

Goal:
Enable keep going so that we can get information immediately for failures.  We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.

Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`.  If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.

I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label).  There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.

After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification.  This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one

To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85) to add a new field like "will_fail": true, and also triggers the log classifier to run

Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed

Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones

Pros:
Not many changes outside of HUD/UI

Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet
2025-06-13 21:21:55 +00:00
e5ed267f83 Update h100-distributed image (#155861)
Move non inductor workflows cuda 12.6->cuda 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155861
Approved by: https://github.com/seemethere
2025-06-13 21:17:05 +00:00
20a74c370b Add error message with assert to topK if ndims() - dim > 4 (#155475)
Addressing #154890

Not really a proper fix but at least it's more informative than the current crash.

For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475
Approved by: https://github.com/kulinseth
2025-06-13 21:10:06 +00:00
049dc48d1e fix code chunk indentation for jit_language_reference_v2.md (#155937)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155781

Description:
As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation.

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label "module: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-06-13 21:05:23 +00:00
731351bb4a Convert rst to markdown - optim.rst #155031 (#155813)
Fixes #155031
![image](https://github.com/user-attachments/assets/36507ca1-eb1e-4358-9e66-ce25ec8a2be1)

@pytorchbot label "docathon-h1-2025" "module: docs" "topic: not user facing" "topic: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155813
Approved by: https://github.com/AlannaBurke
2025-06-13 21:03:39 +00:00
92388bb2ab [export] Remove broken check for multiple cpp files in PT2 package (#155149)
This check was recently added, but (when fixed to refer to CPP rather than library files) fails with the separate kernel and wrapper build of AOTInductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155149
Approved by: https://github.com/angelayi
2025-06-13 21:02:31 +00:00
7d1b3f599d [Docs] Convert to markdown cond.rst, config_mod.rst (#155653)
Related to #155014

Only included 2 files in this PR:

- cond.rst
- config_mod.rst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155653
Approved by: https://github.com/svekars
2025-06-13 20:58:57 +00:00
fdf5d97fa8 [cutlass backend][ez] Log timings from prescreening (#155757)
Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757
Approved by: https://github.com/ColinPeppler
2025-06-13 20:44:04 +00:00
f3e6c8e834 Fix #155016 for Docathon - convert rst to markdown (#155198)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though.

Fixes #155016

Docs comparison (check out the 'new' whenever docs build)

1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html))
2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html))
3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings
4. cudnn_rnn_determinism.rst as is because it's reused in docstrings.
5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198
Approved by: https://github.com/albanD, https://github.com/svekars
2025-06-13 20:24:34 +00:00
bf798a2f01 Change _hfstorage to hfstorage (#155837)
Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D76364024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837
Approved by: https://github.com/saumishr
2025-06-13 20:19:51 +00:00
77f884c2ec Optimize Tensor.backward type hints (#155656)
Fixes #81963

## Test Result

![image](https://github.com/user-attachments/assets/67380fdc-73c4-43d8-b2a5-5e16d63f4fd3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155656
Approved by: https://github.com/soulitzer
2025-06-13 19:16:48 +00:00
06408dae49 Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)"
This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec.

Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))
2025-06-13 19:11:43 +00:00
4628f1b7a9 [Hierarchical-Compile] Track mutations for setitem (#155880)
This fixes a bug in tensor variable where we would not do things like set the example value on setitem nodes (but these don't typically have users so it doesn't matter)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155880
Approved by: https://github.com/anijain2305
2025-06-13 18:59:31 +00:00
344731fb25 Add CUDA 12.9.1 sbsa nightly binaries (#155819)
https://github.com/pytorch/pytorch/issues/155196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155819
Approved by: https://github.com/atalman
2025-06-13 18:52:41 +00:00
ce44877961 [c10d][PGNCCL] Make watchdog thread a class (#155831)
By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-06-13 18:05:22 +00:00
c5d00e150a convert: rst to myst pr 1/2 (#155840)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following two .rst files
- [torch.compiler_dynamo_overview](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_dynamo_overview.rst)
- [torch.compiler_fake_tensor](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fake_tensor.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155840
Approved by: https://github.com/sekyondaMeta
2025-06-13 18:02:28 +00:00
36bf81e363 [BE] Fix minifier when one has multiple Python runtimes (#155918)
By using `sys.executable` instead of `"python"`

Otherwise, it fails on Ubuntu with `python not found` error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155918
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/zou3519
2025-06-13 17:55:04 +00:00
093aaccae2 convert jit_language_reference_v2.rst to jit_language_reference_v2.md (#155781)
Fixes https://github.com/pytorch/pytorch/issues/155023

Description:
converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md`
**I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged.**

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781
Approved by: https://github.com/svekars
2025-06-13 17:33:10 +00:00
f0bee87eea [xla hash update] update the pinned xla hash (#155779)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155779
Approved by: https://github.com/pytorchbot
2025-06-13 17:13:37 +00:00
1f3cc4875c [ATen][CUDA][cuSOLVER] Add cusolverDnXsyevBatched for torch.linalg.eigh (#155695)
This PR add a new API for SYEV operation of cuSOLVER [`cusolverDnXsyevBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxsyevbatched) which is a new alternative to [`cusolverDn<t>syevjBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-syevjbatched). This API was introduced in cuSOLVER as part of 64-bit API in CUDA Tool Kit 12.6.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155695
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-06-13 17:12:26 +00:00
b6add8c8ba [MPSInductor] Fix remainder implementation for int types (#155891)
Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers.

This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types.

This fixes compilation of something like
```python
@torch.compile
def f(x, y):
    return x[y % 5]
```

which beforehand failed to compile with
```
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant long* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[x0];
        auto tmp1 = 12;
        auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1));
        auto tmp3 = 1024;
        auto tmp4 = static_cast<long>(tmp3);
        auto tmp5 = tmp2 + tmp4;
        auto tmp6 = tmp2 < 0;
        auto tmp7 = tmp6 ? tmp5 : tmp2;
        if ((tmp7 < 0) && (tmp7 > 1024)) return;
        auto tmp9 = in_ptr1[tmp7];
        out_ptr0[x0] = static_cast<float>(tmp9);
    }
 with program_source:372:28: error: array subscript is not an integer
        auto tmp9 = in_ptr1[tmp7];
                           ^~~~~
```

This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891
Approved by: https://github.com/manuelcandales
2025-06-13 16:42:56 +00:00
9462106b7e [nativert] Move graph_passes to nativert (#155411)
Summary: Move graph_passes to nativert

Test Plan:
CI

Rollback Plan:

Differential Revision: D76205048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155411
Approved by: https://github.com/zhxchen17
2025-06-13 16:41:01 +00:00
338a8c7853 fix slice w/ dynamic shapes (#153131)
Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes.
Cause overflow error with a huge number “9223372036854776048”
Test Plan: CIs should pass.

Differential Revision: D74354663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131
Approved by: https://github.com/laithsakka
2025-06-13 15:53:17 +00:00
a5938ff431 [BE][c10d/Store]add check in pyi (#155855) (#155865)
Summary:

"check" is already binded https://fburl.com/code/9lx1zf9o
which is also documented in https://docs.pytorch.org/docs/stable/distributed.html
add it to pyi for type checking

Test Plan:
skip

Rollback Plan:

Differential Revision: D76547457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155865
Approved by: https://github.com/fduwjj
2025-06-13 15:39:27 +00:00
bee93f9f0d Move glslc to cas to enable remote execution (#155832)
Meta:
`fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid.
See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734

This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE.

Here are commands executed:
```
% manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz
% manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz
% manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar

% ls
-rw-r--r--  1 navidq  staff   2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz
-rw-r--r--  1 navidq  staff   4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz
-rw-r--r--  1 navidq  staff   4.4M Jun 12 10:03 glslc-windows-v2024_3.tar

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz
ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz
1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar
76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984
```

Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832
Approved by: https://github.com/GregoryComer
2025-06-13 14:38:51 +00:00
ce6e0523f9 Revert "[BE] Raise NotImplementedError (#155470)"
This reverts commit 5ab6a3fb6fd37c542060c606edd4b95c7e3cae82.

Reverted https://github.com/pytorch/pytorch/pull/155470 on behalf of https://github.com/malfet due to foreach tests are failing on ROCm because we are not running the same on CUDA ([comment](https://github.com/pytorch/pytorch/pull/155470#issuecomment-2970592124))
2025-06-13 14:32:50 +00:00
3819584f12 [precompile] Implement PrecompileContext for recording precompile artifacts, integrate with CompilePackage (#154415)
This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling,  dump it all into bytes, and then stitch it back together into a cache of callables.

## Why use CacheArtifactManager?
Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly.

## How is it different from CacheArtifactManager?
Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable.
Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so:

![image](https://github.com/user-attachments/assets/49a0a75b-1e7f-4d96-8d81-6769fe5a53ca)

Whereas we'd visualize PrecompileContext's result like so:

![image](https://github.com/user-attachments/assets/fcc0dd4e-dfbf-4b13-9c08-2e99b373180b)

For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext.

After this PR, precompile consists of three main interfaces:

### CompilePackage
- Everything needed to run one torch.compile'd function (including graph breaks)
- `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry
- `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends
- `cache_entry()` return a serializable cache entry to save

### DynamoStore
- Responsible for tracking CompilePackages on disk (and/or in memory)
- `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact
- `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data
- `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization)

### PrecompileContext
- Overarching context for serializing and deserializing precompile artifacts. Supports **global** and **local** setups.
- `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes
- `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO)
- `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key

<img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415
Approved by: https://github.com/zhxchen17
ghstack dependencies: #155118
2025-06-13 14:11:24 +00:00
b2fc9cfea1 [precompile] Add CompilePackage to serialize dynamo states. (#155118)
Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface:

1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states.
     a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object.
     b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state.
2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry.
3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions.

This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend.

Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118
Approved by: https://github.com/jamesjwu

Co-authored-by: James Wu <jjwu@meta.com>
2025-06-13 13:54:10 +00:00
670dab6c63 [AOTI] Enable OP test__weight_int4pack_mm_with_scales_and_zeros in AOTI. (#155780)
The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780
Approved by: https://github.com/jansel
2025-06-13 11:12:13 +00:00
463fe36532 fix error message on specialization with Dim.DYNAMIC (#155738)
Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`.

This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation.

Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources.

Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738
Approved by: https://github.com/bobrenjc93
2025-06-13 10:33:46 +00:00
6abe450a6f [pytorch Aten] Delete unused duplicate clamp_stub, to avoid compile error (#154631)
I found the `clamp_stub` in `UnaryOps.h` is not used. And it's a duplicate of the `clamp_stub` in `TensorCompare.cpp`:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorCompare.cpp#L313-L314

This diff/PR deletes it as this duplicate caused build failure for me:
```
ATen/native/UnaryOps.h:109:1: error: redefinition of 'clamp_stub_DECLARE_DISPATCH_type'
```

Differential Revision: [D75612521](https://our.internmc.facebook.com/intern/diff/D75612521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154631
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/nautsimon
ghstack dependencies: #154589, #154591
2025-06-13 10:01:51 +00:00
1cc31b213d [MTIA Aten Backend] Migrate bitwise_and.Tensor_out (#154591)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75578498](https://our.internmc.facebook.com/intern/diff/D75578498/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154591
Approved by: https://github.com/malfet
ghstack dependencies: #154589
2025-06-13 10:01:51 +00:00
65b9c13cce [Intel GPU] Enable safe softmax for XPU SDPA (#151999)
Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975

When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption.

With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 08:53:47 +00:00
56b03df6ac [MTIA Aten Backend] Migrate where.self and where.self_out (#154589)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75577304](https://our.internmc.facebook.com/intern/diff/D75577304/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154589
Approved by: https://github.com/malfet
2025-06-13 08:25:13 +00:00
3d595fd559 update get start xpu (#151886)
update link and product name
add print to print ```torch.xpu.is_available()``` result in code snippet for user not using command python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151886
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 07:46:13 +00:00
53d06e18d9 [dynamo] add missing algorithm header (#154754)
Needed for `std::max(<initializer-list>)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154754
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2025-06-13 06:56:11 +00:00
6020440683 remove allow-untyped-defs from adaround_fake_quantize.py (#155621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155621
Approved by: https://github.com/Skylion007
2025-06-13 06:14:22 +00:00
99e99d5bfe [a2av] Test must allocate tensors symmetrically (#155835)
This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks.

In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155835
Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007
ghstack dependencies: #155506
2025-06-13 06:05:38 +00:00
0860606729 [export] Add meta[val] to getattr nodes (#154934)
Fixes [P1830293318](https://www.internalfb.com/intern/paste/P1830293318/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154934
Approved by: https://github.com/yushangdi, https://github.com/muchulee8
2025-06-13 05:48:21 +00:00
25717da8c8 [BE] Don't run the same tests on 2xlarge and 4xlarge (#155859)
Also, speedup builds by moving them to 4xlarge instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155859
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr
2025-06-13 05:40:20 +00:00
a87dfc7480 [symm_mem] Update CMakeList to reflect code moving a dedicated folder (#155823)
We moved all symm_mem code into a folder ([CudaDMAConnectivity](https://github.com/pytorch/pytorch/pull/155573)) but somehow forgot update for CudaDMAConnectivity in the CMakeList.

Users see errors: RuntimeError: DMA connectivity detector for cuda over nvlink is not available while torch.distributed.init_process_group(backend=backend). So this PR should fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155823
Approved by: https://github.com/Skylion007
2025-06-13 05:27:59 +00:00
70bb34929a Convert to .md: draft_export.rst, export.ir_spec.rst, fft.rst (#155567)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check.

Docs comparison (check out the 'new' whenever docs build)

1. draft_export ([old](https://docs.pytorch.org/docs/main/draft_export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/draft_export.html))
2. export.ir_spec ([old](https://docs.pytorch.org/docs/main/export.ir_spec.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.ir_spec.html))
3. fft ([old](https://docs.pytorch.org/docs/main/fft.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/fft.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155567
Approved by: https://github.com/svekars
2025-06-13 05:19:43 +00:00
b878ca0c91 [cutlass backend] add fp8 to cutlass benchmark script (#155507)
Summary:
Add fp8.

Right now FP8 only allows fast_accum.

Test Plan:
```
Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         aten          | 967.1226739883423  | 1136.8895149998868 |  1.219131228979677   |         NA         |
|        triton         | 1764.6185159683228 |  623.08743664783   |  20.373826419003308  | 82.46067054670186  |
| triton_persistent_tma | 1769.0335512161255 | 621.5323768280928  |  20.48663099599071   | 82.91718297956578  |
|  cutlass_lvl_default  | 790.5075550079346  | 1390.8932568835019 |  13.788519630907103  | -18.26191482535096 |
|   cutlass_lvl_3332    | 803.7384748458862  | 1367.996757884245  |  226.81587297911756  | -16.89384434227684 |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
```

Rollback Plan:

Differential Revision: D76310809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507
Approved by: https://github.com/ColinPeppler
2025-06-13 05:11:15 +00:00
2ba930d4ce Convert rst to markdown - profiler.rst #155031 (#155559)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [profiler.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/profiler.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155559
Approved by: https://github.com/svekars
2025-06-13 05:02:54 +00:00
e8b3dfa7c0 convert jit_language_reference.rst to jit_language_reference.md (#155633)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

- converted jit_language_reference.rst to jit_language_reference.md

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155633
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 04:58:28 +00:00
3f65e38b73 Convert hub.rst to hub.md (#155483)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155483
Approved by: https://github.com/svekars
2025-06-13 04:39:55 +00:00
0a6b66c881 Inductor comms reorder logs to tlparse (#155737)
Hacked test_inductor_collectives test to demonstrate this works:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Follow up: it would be nice to move the logging out of this pass and
into the broader comms pass loop, where the before/after each pass
visualization could be logged into the same tlparse file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737
Approved by: https://github.com/bdhirsh
2025-06-13 02:59:42 +00:00
f151b20123 [AOTI] Remove the emit_current_arch_binary option (#155768)
Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768
Approved by: https://github.com/angelayi
2025-06-13 02:06:07 +00:00
020da74437 [Easy] Remove empty file (#155796)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155796
Approved by: https://github.com/malfet
ghstack dependencies: #155772
2025-06-13 01:42:11 +00:00
905b194a2e Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318)
Fixes #136849

## Test Result

```python
>>> import torch
>>> device = torch.cuda.device_count() + 1
>>> torch.cuda.current_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream
    streamdata = torch._C._cuda_getCurrentStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.default_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream
    streamdata = torch._C._cuda_getDefaultStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.set_per_process_memory_fraction(0.5, device)  #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction
    torch._C._cuda_setMemoryFraction(fraction, device)
RuntimeError: Allocator not initialized for device : did you call init?

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318
Approved by: https://github.com/albanD
2025-06-13 01:20:19 +00:00
d7e657da35 pyfmt lint more torch/utils files (#155812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155812
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782, #155783
2025-06-12 23:51:42 +00:00
4d3ecefda5 [aoti][mps] Use cpp sym-expr printer (#155646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155646
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582, #155583
2025-06-12 23:33:28 +00:00
2e65d72e1e [aoti][mps] Fix int/symint kernel args (#155583)
Integer arguments to mps kernels need to go through a different function, since `aoti_torch_mps_set_arg` only takes a Tensor. So I added a `aoti_torch_mps_set_arg_int`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155583
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582
2025-06-12 23:33:28 +00:00
ffbda61fbe [aoti][mps] Fix dynamic dispatch size (#155582)
In the case where we pass in a symint to the `dispatch` call, the compiler errors, so we need to cast the input to int64_t.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155582
Approved by: https://github.com/malfet
ghstack dependencies: #155752, #154287
2025-06-12 23:33:15 +00:00
a4ab392251 [aoti][mps] mps constants support (#154287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287
Approved by: https://github.com/malfet
ghstack dependencies: #155752
2025-06-12 23:33:07 +00:00
8821a9dc4e [BE][aoti][mps] Fix tests to use common function (#155752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155752
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-06-12 23:32:59 +00:00
5ab6a3fb6f [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-12 23:19:12 +00:00
d9b8369f39 fix warning spam for list indexing (#155815)
Per title, #154806 incorrectly placed a warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155815
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-12 23:07:24 +00:00
2903e5ad3c pyfmt lint more export files (#155783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155783
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782
2025-06-12 23:04:11 +00:00
86b1116f22 pyfmt lint torch/_custom_op/* (#155782)
file torch/_custom_op/functional.py does not exisits
file torch/_custom_op/__init__.py is empty.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155782
Approved by: https://github.com/Skylion007
2025-06-12 23:04:11 +00:00
4cdbdcdbcf Switch to miniconda for ROCm CI (#155239)
Related to https://github.com/pytorch/pytorch/issues/148335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155239
Approved by: https://github.com/jeffdaily
2025-06-12 22:55:47 +00:00
f04fd4dc4e typing: allow integer in bitwise operations (#155704)
Fixes #155701 (false positives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704
Approved by: https://github.com/Skylion007, https://github.com/aorenste
2025-06-12 22:40:17 +00:00
938515fa75 [aoti] Update cshim for all backends (#155604)
Fixes https://github.com/pytorch/pytorch/issues/155349
`python torchgen/gen.py --update-aoti-c-shim` will now update all cpu/cuda/mps/xpu shims -- I verified this using `aten._print.default`, but didn't commit the changes since I'm not sure if we actually want to add this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155604
Approved by: https://github.com/desertfire, https://github.com/janeyx99
2025-06-12 22:10:58 +00:00
38bfd462b8 Use swap_tensors path in nn.Module.to for FakeTensor (#152539)
Fixes https://github.com/pytorch/pytorch/issues/148977

Differential Revision: [D76458023](https://our.internmc.facebook.com/intern/diff/D76458023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539
Approved by: https://github.com/albanD
2025-06-12 22:08:21 +00:00
db01f1032f Support XPU in memory tracker (#150703)
This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU.

Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k
2025-06-12 21:33:52 +00:00
154a39bfbd basic compile support for grouped_mm (#153384)
grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384
Approved by: https://github.com/eellison
2025-06-12 21:24:51 +00:00
f2b44424a1 [ROCm] Skip *_stress_cuda and test_ddp_apply_optim_in_backward* (#155724)
These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-06-12 21:18:04 +00:00
590fe4d2d7 Skip updating the default device distributed backend if already registered (#155320)
Motivation:

PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`).

Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However,  if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want.

Therefore, we would like to skip updating the default distributed backend if one is already registered in the map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-06-12 21:17:06 +00:00
29391c7cf9 [ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811)
9df2e8020f/1

8e8d4b13b0 (43980565863-box)

started OOMing after switching to cuda 12.8

Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-12 21:05:32 +00:00
093fd47dbe Add a Additional Example that showcases the usage of torch.autograd.functional.jacobian (#155683)
Fixes #132140

As described in the issue, I've added an example that showcases the use of higher-dimensional inputs and outputs, batched inputs, and the vectorize=True with `torch.autograd.functional.jacobian`.

Could you please review?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155683
Approved by: https://github.com/soulitzer
2025-06-12 19:46:55 +00:00
e6d71f3789 Support re-sharding for safetensors checkpoints (#154519)
This change will add the ability to support re-sharding for hf safetensors checkpoints.
This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class.

Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519
Approved by: https://github.com/saumishr
2025-06-12 19:38:29 +00:00
d3da03d6fa [2/n]passing event log handler to record function calls (#155457)
Summary: This diff modifies the elastic agent's API to pass the event log handler to the record function calls. This change enables the elastic agent to log events to a specific destination, improving the monitoring and debugging capabilities of the distributed training process.

Test Plan:
unit tests

ran an e2e training job.

Differential Revision: D75194115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155457
Approved by: https://github.com/d4l3k
2025-06-12 19:35:08 +00:00
e085012335 Fix #155020 - rst2markdown for export.rst (split PR) (#155753)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check. This is the 3rd one.

Docs comparison (check out the 'new' whenever docs build)

1. export ([old](https://docs.pytorch.org/docs/main/export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155753
Approved by: https://github.com/sekyondaMeta
2025-06-12 19:30:52 +00:00
4bb936d8b7 refresh expected results (#155817)
some changes landed when the test is recently unstable with out updating the results.
<img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817
Approved by: https://github.com/yushangdi
2025-06-12 19:14:21 +00:00
7986c0dba6 rename distributed.rst to md (#155767)
Fixes #155019

For sanity checks, split PR to have this one only include distributed.rst -> distributed.md

Preview -> [distributed.md](https://docs-preview.pytorch.org/pytorch/pytorch/155767/distributed.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155767
Approved by: https://github.com/sekyondaMeta
2025-06-12 18:42:15 +00:00
bcad962550 [BE][Testing] Delete some unused code (#155760)
- Fix typo in class name `OpenRgistration`->`OpenRegistration`
- Use existing `common` alias of `torch.testing._internal.common_utils`, i.e. `s/torch.testing._internal.common_utils.markDynamoStrictTest/common.markDynamoStrictTest/`
- Remove unused `TEST_CUDA` and `TEST_ROCM` are unused in that file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155760
Approved by: https://github.com/albanD
2025-06-12 18:41:53 +00:00
fac0cc16ef [scan] fix doc of scan and list the restrctions. (#155577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155577
Approved by: https://github.com/zou3519
2025-06-12 18:22:28 +00:00
a1257446f8 [AOTInductor] Memory leak fix for Fallback Kernels (#155642)
Summary:
We generate AtenTensorHandles for Fallback kernels regardless of the arg
type. If we indeed "fallback", we will regenerate the AtenTensorHandles
that will cause the first handle being generated not recycled, thus a
memory leak would occur.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_fallback_mem_leak

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155642
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-06-12 17:42:56 +00:00
0d3d84d866 [CD] Windows Magma build 12.9 and cuda scripts (#155799)
Scripts needed to build Magma and CUDA on windows
Same as https://github.com/pytorch/pytorch/pull/146653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155799
Approved by: https://github.com/jeanschmidt
2025-06-12 17:41:24 +00:00
430cc1c636 Run tests on Amazon EC2 M8g Instances (#153940)
Requires machines configured here: https://github.com/pytorch/test-infra/pull/6642

This adds additional test runs against AWS Graviton4 processors, alongside existing runs against AWS Graviton3 and AWS Graviton2 processors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153940
Approved by: https://github.com/fadara01, https://github.com/malfet
2025-06-12 17:33:08 +00:00
522a18bd6c Fix provenance unit test (#155747)
Summary: Fix the test to adapt added provenance tracking in D75837494

Test Plan:
```
 buck2 run @//mode/dev-nosan  fbcode//caffe2/test:fx -- -r test_graph_provenance
```

Rollback Plan:

Differential Revision: D76466778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155747
Approved by: https://github.com/YUNQIUGUO
2025-06-12 17:26:43 +00:00
50d8168c8b [DTensor] Support in gradient placement for local_map() (#155181)
Support `in_grad_placements` argument in torch.distributed.tensor.experimental.local_map().  The argument helps enforce placement of gradient of the input Dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155181
Approved by: https://github.com/wanchaol
2025-06-12 17:07:04 +00:00
6c0b42fd2f [inductor][cutlass backend] Log prescreening elpase (#155508)
Differential Revision: [D76311352](https://our.internmc.facebook.com/intern/diff/D76311352/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155508
Approved by: https://github.com/jingsh
2025-06-12 16:48:52 +00:00
c1ae768baa Basic MTIA ATen CMake (#155477)
Summary: Basic ATen CMake

Differential Revision: D75203592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155477
Approved by: https://github.com/andyanwang, https://github.com/cyyever
2025-06-12 16:29:32 +00:00
f4376cac54 unify symbolic_shapes and sizevars dynamic shapes APIs naming 1 (#154774)
Inductor have a set of APIs that allows performing symbolic evaluations similar to that of symbolic shapes
but it operates on sympy expressions instead of symnodes. Namings are not consistent making them consistent
in this stack.

Step 1 : unify statically_know_true naming! for consistent experience.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154774
Approved by: https://github.com/drisspg, https://github.com/bobrenjc93, https://github.com/eellison
2025-06-12 16:11:55 +00:00
9df2e8020f fix code indentation for fx.md (#155764)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155482

Description:
As discussed here https://github.com/pytorch/pytorch/pull/155482#pullrequestreview-2918032289, I removed indentation for python code blocks as a follow-up modification for fx.md

Checklist:

- [x] The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155764
Approved by: https://github.com/svekars
2025-06-12 16:02:33 +00:00
75824035d3 [dynamic shapes] skip fused linear path if not definitely contiguous (#155051)
Falls back to non-fused linear -> add bias path for non-contiguous tensors with unbacked sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155051
Approved by: https://github.com/laithsakka
2025-06-12 15:55:21 +00:00
51560797ce [CI] Reuse old whl: switch default to always (#155572)
Switch default to always reuse old whl

I have a few worries about API rate limits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155572
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-12 15:43:29 +00:00
62fa3f5aeb Support tuning of _grouped_mm (#153953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153953
Approved by: https://github.com/ngimel
2025-06-12 15:39:35 +00:00
6b3eef6d31 [cutlass backend] Only consider to use re worker if nvcc doesn't exist (#155745)
Differential Revision: D76463340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155745
Approved by: https://github.com/masnesral
2025-06-12 15:23:52 +00:00
851a6fa82d [MPS] Migrate softshrink (forward and backward) to Metal kernel (#155586)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155586
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479, #155571
2025-06-12 15:02:43 +00:00
2a3b41cbd0 Revert "[CI] Use setup-python from for Mac tests (#155698)"
This reverts commit 2b9d638e3333e6e9ae324e1486774e83292e1883.

Reverted https://github.com/pytorch/pytorch/pull/155698 on behalf of https://github.com/malfet due to It causes weird flaky failures in MPS and do not upload usage logs anymore ([comment](https://github.com/pytorch/pytorch/pull/155698#issuecomment-2967120676))
2025-06-12 14:42:32 +00:00
0fd711df19 [export] Allow user frame to be None when symbolic shape tries to get stacktrace. (#155744)
Summary: Fixing https://github.com/pytorch/pytorch/issues/155605

Test Plan:
CI

Rollback Plan:

Differential Revision: D76463358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155744
Approved by: https://github.com/angelayi
2025-06-12 14:36:29 +00:00
dd1b6621bc Remove C10_DEPRECATED references in c10 (#151058)
Summary:
Revive https://github.com/pytorch/pytorch/pull/138406.  Only limit the scope to files in c10.

Summary from the original PR,
```
Looking in the code I see

// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later.

Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]].

Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED.
```

Test Plan: CI

Differential Revision: D72762767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058
Approved by: https://github.com/r-barnes
2025-06-12 13:38:03 +00:00
d632cf2cc9 [Easy][Code Clean] Remove the unused and undefined function in pickler (#155772)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155772
Approved by: https://github.com/malfet
2025-06-12 13:03:36 +00:00
8e8d4b13b0 [XPU] Simplify XPU make triton by install from PyTorch source (#155675)
Remove install from source code build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155675
Approved by: https://github.com/atalman
2025-06-12 13:02:23 +00:00
132babe7e0 [user triton] dynamo support for new host-side TMA API (#155662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155662
Approved by: https://github.com/aakhundov
ghstack dependencies: #155510
2025-06-12 12:56:23 +00:00
9cced33c7c [BE]: Update cudnn to 9.10.2.21 (#155576)
Update to CUDNN 9.10.2.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-12 12:50:36 +00:00
c199a4d0fd Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)
Move non inductor workflows cuda 12.6->cuda 12.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234
Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet
2025-06-12 12:42:34 +00:00
eecaa0bbc6 [Multiprocesing] Fix _release_ipc_counter missing in rebuilding cuda ipc tensor with UntypedStorage (#155312)
Fixes https://github.com/pytorch/pytorch/issues/155311

To avoid `torch.multiprocessing.reductions::rebuild_cuda_tensor` failed on untyped storage, this FIX PR adds the `_release_ipc_counter` into UntypedStorage like the previous legacy typed storage.

e2d141dbde/torch/storage.py (L1466-L1469)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155312
Approved by: https://github.com/mikaylagawarecki
2025-06-12 10:41:58 +00:00
0029259bdf Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-12 09:58:15 +00:00
d3d655ad14 [Hierarchical-Compile] Hash int args in addition to input shapes (#155655)
Fixes Swsl_resnext101_32x16d in TIMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155655
Approved by: https://github.com/anijain2305
2025-06-12 06:35:12 +00:00
c3ecabf059 [inductor][triton pin] add support for new TMA API for mm.py templates (#155723)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488

For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs).

For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API).

For mm_scaled_grouped.py, https://github.com/pytorch/pytorch/pull/150944 will remove TMA support for Triton 3.2.

Note: we attempted this earlier with https://github.com/pytorch/pytorch/pull/154858, but this broke TMA usage in Triton 3.2.

Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155723
Approved by: https://github.com/NikhilAPatel
2025-06-12 06:25:47 +00:00
2b9d638e33 [CI] Use setup-python from for Mac tests (#155698)
Instead of `setup-miniconda`
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
- Skip `TestMultiprocessing.test_fs_sharing` as even though it completes, it hangs on the shutdown both in CI and in all local setups I have
- Skip `TestCppExtensionOpenRgistration.test_base_device_registration` as it hangs on the shutdown as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515, #155697
2025-06-12 04:58:00 +00:00
57e4d7b5cc [nativert] Move DelegateExecutor to PyTorch core (#155581)
Summary:
Moves DelegateExecutor base class to PyTorch core. It provides the extension point of backend delegation for NativeRT.
Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
This is only a virtual base class. So relying on internal CI is sufficient.

Rollback Plan:

Differential Revision: D76351984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155581
Approved by: https://github.com/zhxchen17
2025-06-12 04:33:31 +00:00
a9d5157e25 [dynamo] Use BINARY_SUBSCR for pre-graph bytecode for regular dict accesses (#155727)
vLLM profiler sets with_stack=True that shows the dict_getitem on the profiler, both inflating the numbers and confusing compile users. This PR keeps BINARY_SUBSCR for regular dicts, while using `dict.__getitem__` only for dict subclasses.

Using binary_subscr is little bit faster, but not enough to make any major latency improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155727
Approved by: https://github.com/zou3519, https://github.com/StrongerXi, https://github.com/jansel
2025-06-12 04:02:29 +00:00
c9e9a0c823 [inductor][invoke_subgraph] Mark invoke_subgraph outputs as user_visible to constrain output strides (#155395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155395
Approved by: https://github.com/zou3519
2025-06-12 03:58:16 +00:00
9f5153b1a4 Preserve GrpahModule node stack trace after torch package deserializaion re-tracing (#155638)
Summary:
urrently the node.meta["stack_trace"] is not preserved when we torch package/load GraphModule, which means the original stack trace is lost. When we re-trace the packaged graph module, we just get a stack trace like fx-generated._0......

Adding the node.meta["stack_trace"] to torch packaged graph module

Test Plan:
```
buck2 run @//mode/dev-nosan fbcode//caffe2/test:package -- -r  TestPackageFX
```

Rollback Plan:

Differential Revision: D76379692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155638
Approved by: https://github.com/angelayi
2025-06-12 03:48:27 +00:00
ce9ba071fd [BE] Fix warning in open_registration_extension.cpp (#155755)
Namely
```
/Users/nshulga/git/pytorch/pytorch/test/cpp_extensions/open_registration_extension.cpp:306:33: warning: left operand of comma operator has no effect [-Wunused-value]
  306 |   at::Tensor first = at::empty((2,3)).to(at::DeviceType::PrivateUse1);

```

Or switching between Python and C++ is hard
In Python `(2, 3)` creates a tuple, in C/C++ it's just a integral literal 3

P.S. I could have vibe-coded the fix with Claude: https://claude.ai/share/82479e88-84cb-4299-aa2f-dafd28ee2d55

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155755
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-06-12 03:01:30 +00:00
d96dec8415 [export] Fix serialization for call_torchbind hop with as_none argument (#155647)
Summary:
As title.

D75251816 broke one internal test. This diff fixes it.

Test Plan: Internal CI

Differential Revision: D76383202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155647
Approved by: https://github.com/ydwu4
2025-06-12 02:59:03 +00:00
b00b641ff1 [Docs] Convert to markdown: accelerator.rst, amp.rst, autograd.rst, backends.rst, benchmark_utils.rst (#155762)
Fixes #155013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155762
Approved by: https://github.com/svekars
2025-06-12 02:55:06 +00:00
b6f84b3b0f [Inductor][CPU] Use AMX-based microkernels when M > 4 for GEMM template for INT4 weight (#155444)
**Summary**
GEMM templates for INT4 weights are used for lowering `aten._weight_int4pack_mm_for_cpu` with Inductor when max-autotune is on. Currently, AMX-based microkernels are used only when M >= 16 if input tensor has shape [M, K]. However, we find that AMX kernel brings performance benefit when 4 < M < 16. For example, on a 6th gen of Intel(R) Xeon(R) platform, E2E latency can be improved by up to > 20% when running Llama-3.1-8B on 32 cores for M = 8. So, this PR changes the threshold so that AMX is used when M > 4.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155444
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
2025-06-12 02:28:48 +00:00
212575f994 [ca] Annotate AccumulateGrad branching and add polyfill tests (#155289)
Annotates AccumulateGrad and tracks the semantics for AccumulateGrad's polyfill , except for Scenario 1.4: Cloning MKLDNN new_grad and Scenario 2.2: Vmap-incompatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155289
Approved by: https://github.com/jansel, https://github.com/albanD
2025-06-12 02:10:52 +00:00
d84efde3f0 Move _storage_Use_Count to be gerneric (#155451)
# Motivation
`torch._C._storage_Use_Count` should be a generic API that is not aware of device type. It is also used in 337cd7c53d/torchtune/training/_activation_offloading.py (L323) to do some memory optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155451
Approved by: https://github.com/albanD
2025-06-12 01:39:04 +00:00
8372d0986a Revert "[PT2][partitioners] Add aten.split to view_ops list (#155424)"
This reverts commit e1db10e05aa720aef1989773adcf48f311bcf920.

Reverted https://github.com/pytorch/pytorch/pull/155424 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_repro.py::CPUReproTests::test_transpose_with_norm [GH job link](https://github.com/pytorch/pytorch/actions/runs/15596830833/job/43931044625) [HUD commit link](e1db10e05a) but idk how, reverting to see if it fixes the problem ([comment](https://github.com/pytorch/pytorch/pull/155424#issuecomment-2964717706))
2025-06-12 01:38:34 +00:00
9b122aab5d Fix set per proc memory fraction when running tests (#155631)
env setting needs to happen before pool creation for it to take effect

In theory this should fix some OOMs and also cause some OOMs, but this PR is green so idk

alt options: use initializer?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155631
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-12 01:28:08 +00:00
8ad6197b46 [draft export] avoid storing intermediate real tensors in proxies (#154630)
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.

While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.

Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?

Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-06-12 01:18:57 +00:00
4e19477196 [nativert] Move Pytree (#155136)
Summary: fbcode/sigmoid/core/common -> fbcode/caffe2/torch/nativert/common

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:pytree_test
```

OSS CI

Rollback Plan:

Differential Revision: D75965059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155136
Approved by: https://github.com/zhxchen17, https://github.com/XuehaiPan, https://github.com/zou3519
2025-06-12 01:10:34 +00:00
ee5c2908cb [dtensor] refactor PlacementStrategy -> OpSpec, move utils to OpSchema (#155592)
as titled. It's sometimes confusing to use PlacementStrategy as a name,
as we also have OpStrategy and TupleStrategy, the latter two contain
the former, so it is better to make the naming clearer.

Renaming PlacementStrategy -> OpSpec as it is an operator spec that
contains output_spec + input_specs.

Also found some utils can be merged to OpSchema so included together in
this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155592
Approved by: https://github.com/awgu
2025-06-12 00:51:36 +00:00
7485ef078f Run torch.compile benchmark more frequently on H100 (#155719)
We have more capacity now with 20+ `linux.aws.h100` runners, half of them are idle.  Running benchmark more frequently would utilize these runner better and provide early signals multiple times per day.  Running every 8 hours to start with.  The workflow usually finishes within 5 hours https://github.com/pytorch/pytorch/actions/runs/15578331612/job/43878878434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155719
Approved by: https://github.com/atalman
2025-06-12 00:24:21 +00:00
9e9484d022 [SymmMem] Enable NVSHMEM for Triton (#155506)
(This is an **Experimental** feature)
Allow Triton kernels to invoke NVSHMEM device functions.

### Example Triton program
Key parts:
- Call `nvshmem.enable_triton()` to initialize;
- Call `nvshmem.putmem_block` in Triton kernel;
- Add `extern_libs` kwarg at kernel invocation.

```
import torch.distributed._symmetric_memory._nvshmem_triton as nvshmem

@triton.jit
def put_kernel(
    dst_ptr,
    src_ptr,
    numel: tl.constexpr,
    peer: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

if __name__ == "__main__":
    # Enable NVSHMEM for Triton
    nvshmem_lib = nvshmem.enable_triton()

    # Use torch Symmetric Memory to allocate Symmetric tensors
    ...

    peer = 1 - rank
    if rank == 0:
        kernel = put_kernel[(1, 1, 1)](
            dst_ptr,
            src_ptr,
            numel=numel,
            peer=peer,
            BLOCK_SIZE=BLOCK_SIZE,
            extern_libs=nvshmem_lib,
        )

    dist.barrier()
    if rank == 1:
        print(f"Rank {rank}: received {out=}")
```

### Test output:
```
$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put
Rank 0: writing value 5 to Peer 1
Rank 1: received out=tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:1', dtype=torch.int8)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155506
Approved by: https://github.com/ngimel, https://github.com/fegin, https://github.com/fduwjj
2025-06-12 00:22:49 +00:00
cf9878d7a2 Fix #155022 rst to markdown conversion (#155540)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155022

Docs comparison (check out the 'new' whenever docs build)

1. func.ux_limitations ([old](https://docs.pytorch.org/docs/main/func.ux_limitations.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.ux_limitations.html))
2. func.whirlwind_tour ([old](https://docs.pytorch.org/docs/main/func.whirlwind_tour.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/func.whirlwind_tour.html))
3. future_mod ([old](https://docs.pytorch.org/docs/main/future_mod.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/future_mod.html))
4. futures ([old](https://docs.pytorch.org/docs/main/futures.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/futures.html))
5. fx.experimental ([old](https://docs.pytorch.org/docs/main/fx.experimental.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155540/fx.experimental.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155540
Approved by: https://github.com/AlannaBurke, https://github.com/svekars
2025-06-12 00:21:22 +00:00
7918978653 [dynamo] uploaded full json file of all unimplemented_v2() calls currently in repository (#155758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155758
Approved by: https://github.com/williamwen42
2025-06-12 00:17:28 +00:00
a6210fd07b [c10d] Enhance get_process_group_ranks() to accept group=None (#154902)
Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified.

Test Plan:
contbuild & OSS CI

Rollback Plan:

Differential Revision: D75817800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902
Approved by: https://github.com/wz337
2025-06-11 23:41:03 +00:00
eqy
bd3c32916c [cuDNN] Enabled dilation for deterministic convolutions in cuDNN (#154292)
Provides order-of-magnitude speedup over fallback impl.

https://github.com/pytorch/pytorch/issues/28777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154292
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-11 23:35:52 +00:00
c13e725edd Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata (#154518)
As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR.
In addition this PR adds an integration test in addition to the unit tests.
It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend.

Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518
Approved by: https://github.com/saumishr
2025-06-11 23:35:05 +00:00
1b032384b1 Convert rst files to md (#155369)
Fixes #155021
Fixes #155158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155369
Approved by: https://github.com/svekars, https://github.com/malfet
2025-06-11 23:00:52 +00:00
48921721d8 [MPS] Fix binary builds (#155733)
Introduced by https://github.com/pytorch/pytorch/pull/155611

All functions in those headers must be static and inline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155733
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-06-11 22:55:33 +00:00
c1446e1e9d [easy] revert unintended changes from #152579 (#155614)
Summary:
I accidentally removed a test and a small change in my pr:
https://github.com/pytorch/pytorch/pull/152579
- `test_load_package_multiple_gpus` from https://github.com/pytorch/pytorch/pull/152093

Rollback Plan:

Differential Revision: D76370555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155614
Approved by: https://github.com/jingsh
2025-06-11 22:54:58 +00:00
4a954fc185 [refactor] make do_auto_functionalize_v2 take HopInstance (#154192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154192
Approved by: https://github.com/zou3519
ghstack dependencies: #155261, #154072, #154191
2025-06-11 22:52:37 +00:00
d6be87648f [hop schema] add schema.tree_spec to support pytree inputs (#154191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154191
Approved by: https://github.com/zou3519
ghstack dependencies: #155261, #154072
2025-06-11 22:52:37 +00:00
6ded656aee [hop] auto functionalize invoke_subgraph (#154072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154072
Approved by: https://github.com/zou3519
ghstack dependencies: #155261
2025-06-11 22:52:28 +00:00
20fb8f5d1f [refactor] make check input alias and mutation easier to use (#155261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155261
Approved by: https://github.com/zou3519
2025-06-11 22:52:21 +00:00
61e13782dd [inductor] handle -1 for pointless view pairs (#155295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155295
Approved by: https://github.com/laithsakka, https://github.com/jansel
2025-06-11 22:20:36 +00:00
458cc7213b DOC: Convert to markdown: mobile_optimizer.rst, model_zoo.rst, module_tracker.rst, monitor.rst, mps_environment_variables.rst (#155702)
Fixes #155026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155702
Approved by: https://github.com/sekyondaMeta, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-11 22:16:04 +00:00
e1db10e05a [PT2][partitioners] Add aten.split to view_ops list (#155424)
Summary: Add `aten.split` to view_ops list in partitioners.py

Test Plan:
na

Rollback Plan:

Differential Revision: D76011951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155424
Approved by: https://github.com/xuanzhang816
2025-06-11 22:12:13 +00:00
f59c76b549 Revert "[BE]: Update cudnn to 9.10.2.21 (#155576)"
This reverts commit 2d3615f577894c7a117a55e85bb8371bb598ec50.

Reverted https://github.com/pytorch/pytorch/pull/155576 on behalf of https://github.com/malfet due to breaks the same test again (I remember there were a version that adjusted tolerances), see bc3972b80a/1 ([comment](https://github.com/pytorch/pytorch/pull/155576#issuecomment-2964404710))
2025-06-11 22:03:45 +00:00
bc3972b80a [reland] Add stack_trace on make_fx (#155486)
Summary:
Previosuly, we only add stack trace in class _ModuleStackTracer(PythonKeyTracer) for non-strict export. I moved this stack trace logic to the parent class PythonKeyTracer, this way the graph traced from Module using make_fx will have stack_trace as well.

Motivation: we've observed some uses cases where users first use make_fx on the Module, and then run export on the resulting graph. If the result of make_fx doesn't have stack trace, the stack trace information is lost.

**User needs to turn this on by passing in `stack_trace=True` to make_fx. We don't make this the default option since this might increase inductor compilation time (`make_fx` is used in inductor to trace graph patterns for pattern matching). It's also turned on if `_inductor.config.trace.enabled` is True.**

**preserving stack trace is on by default for ModuleStackTracer, which is used for non-strict export.**

Test Plan:
```
buck run test:test_export -- -r  test_stack_trace
buck run fbcode//caffe2/test/dynamo:test_dynamo -- -k test_autocast_ordering
```

Rollback Plan:

Differential Revision: D76298692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155486
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-11 21:27:43 +00:00
9bd0830ed8 [dynamic shapes] guard_or_false for cat, repeat (#155290)
Summary:
assumes:
- specified repeats are non-negative
- 1d cat arguments like [u0] aren't non-zero sized (replaces existing size-oblivious)

Test Plan:
test_export

Rollback Plan:

Differential Revision: D76092011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155290
Approved by: https://github.com/laithsakka
2025-06-11 21:03:32 +00:00
4609699bfd [MPS] Migrate leaky_relu (forward and backward) to Metal kernel (#155571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155571
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479
2025-06-11 20:58:46 +00:00
f8d93b3783 [MPS] Migrate hardswish (forward and backward) to Metal kernel (#155479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155479
Approved by: https://github.com/kulinseth, https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462
2025-06-11 20:58:46 +00:00
db5970c1a6 [coreml-backend-tool] fix pytorch-backended issue on new coremltools (#155543)
Summary:
the new coreml tool is export mlpakage instead mlmodel in default option.  when we use new 8.0 coreml tool to convert to backend, the error is

```
Exception: MLModel of type mlProgram cannot be loaded just from the model spec object. It also needs the path to the weights file. Please provide that as well, using the 'weights_dir' argument.
```

Test Plan:
tested with internal workflow

Rollback Plan:

Differential Revision: D76325462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155543
Approved by: https://github.com/shoumikhin
2025-06-11 20:52:26 +00:00
cec264c8c6 remove single remaining gso from compute_stride (#155635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155635
Approved by: https://github.com/ColinPeppler
2025-06-11 20:36:21 +00:00
cc09d3a5ba remove float args benchmark (#155674)
This benchmark very sensitive. removing it for now until we make it better .

<img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674
Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93
2025-06-11 20:34:58 +00:00
2d3615f577 [BE]: Update cudnn to 9.10.2.21 (#155576)
Update to CUDNN 9.10.2.21
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155576
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 20:32:07 +00:00
94ae615337 [trymerge] Error on ghstack commit with multiple PRs (#154941)
see https://github.com/pytorch/pytorch/issues/154427#issuecomment-2932941343 for context

Errors if do not find 1 match in ghstack commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154941
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman
2025-06-11 20:26:50 +00:00
b7a73a2cdb Convert to markdown: export.programming_model.rst (#155659)
Converts only export.programming_model.rst to markdown

Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020, but split into a second PR to pass sanity check

Docs comparison (check out the 'new' whenever docs build)

1. export.programming_model ([old](https://docs.pytorch.org/docs/main/export.programming_model.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155659/export.programming_model.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155659
Approved by: https://github.com/sekyondaMeta
2025-06-11 20:23:46 +00:00
1b6772a90f A small fix in do_bench_using_profiling (#155500)
Summary: Results: https://docs.google.com/document/d/1B_4rtiDFPH_jV3VpnqLPnInwDMpF7yX29G82UoJTcu8/edit?tab=t.0

Test Plan:
```
buck2 run mode/opt  -c fbcode.enable_gpu_sections=true ai_acceleration/float8/benchmarks/bench:bench_fp8_shapes_eval 2>&1 | tee output44.txt
```

Rollback Plan:

Differential Revision: D76298690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155500
Approved by: https://github.com/yoyoyocmu, https://github.com/nmacchioni
2025-06-11 20:06:19 +00:00
1dd0b1d12b Unbreak torch.is_vulkan_available() on Mac (re-send of #154675, please stamp) (#155595)
This is a new PR duplicating #154675 due to merge issues with that PR coming from my old (now updated) version of ghstack.

I am a Vulkan noob, but this extension and flag seem to be necessary. See "Encounted VK_ERROR_INCOMPATIBLE_DRIVER" at https://vulkan-tutorial.com/Drawing_a_triangle/Setup/Instance .

(For anyone trying to repro at home, I have the following homebrew packages installed, not all of which may be necessary: molten-vk, vulkan-headers, vulkan-loader, vulkan-tools, vulkan-utility-libraries. I also have VK_ICD_FILENAMES set to /opt/homebrew/etc/vulkan/icd.d/MoltenVK_icd.json, and I built PyTorch with USE_VULKAN=1. Making sure vkcube works helped me debug this setup.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155595
Approved by: https://github.com/malfet
2025-06-11 19:51:35 +00:00
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
f80a61adf5 Revert "[dynamo] added github_cli to detect unimplemented_v2 calls (#155610)"
This reverts commit 5dd07c70e53a86b73f49711b8186d86dc4f1b32a.

Reverted https://github.com/pytorch/pytorch/pull/155610 on behalf of https://github.com/malfet due to Looks like it fails on every pull request, based on https://github.com/pytorch/pytorch/actions/workflows/check-unimplemented-calls.yml, but it does not run on trunk ([comment](https://github.com/pytorch/pytorch/pull/155610#issuecomment-2963929765))
2025-06-11 19:31:55 +00:00
1e373d02d5 [ONNX] Change deprecation message from 2.8 to 2.9 (#155580)
~~The PR: https://github.com/pytorch/pytorch/pull/152478 did not respect the release policy that the deprecation should happen after the deprecation message has been set for 2 releases. This PR postpone 2.8 to the rightful version 2.10.~~

~~NOTE: "as early as" 2.10 shall give ONNX users more time to adapt and provide feedback.~~

To follow the upcoming torchscript deprecation, `torch.onnx.export` expects to switch dynamo=True (also turn on fallback=True for bc) on torch 2.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155580
Approved by: https://github.com/justinchuby, https://github.com/tugsbayasgalan
2025-06-11 19:31:29 +00:00
3f29642ecf Update XLA pin (#155471)
Update pin after XLA PR https://github.com/pytorch/xla/pull/9312 landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155471
Approved by: https://github.com/laithsakka
2025-06-11 19:16:52 +00:00
f8baec8984 Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-06-11 19:12:52 +00:00
6dfada220e [ca] better error message for subclasses not supported by FakeTensor (#155481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155481
Approved by: https://github.com/jansel
ghstack dependencies: #155473, #155570
2025-06-11 19:09:29 +00:00
5dcc718a77 [dynamo][ci] update PYTORCH_TEST_WITH_DYNAMO xfail/skips script for 3.13 (#155570)
No more 311 runners, tested by generating the files for the next PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155570
Approved by: https://github.com/zou3519
ghstack dependencies: #155473
2025-06-11 19:09:29 +00:00
87b002b6fb [ca] make torch.compile API respect ambient disable contexts (#155473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155473
Approved by: https://github.com/jansel
2025-06-11 19:09:29 +00:00
be124a61a4 [MPS] Migrate hardsigmoid (forward and backward) to Metal kernel (#155462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155462
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316
2025-06-11 19:09:23 +00:00
c04a4e7094 Add types to torch/utils/_triton.py (#155612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155612
Approved by: https://github.com/jamesjwu
2025-06-11 19:04:10 +00:00
2002e3a311 [Docs] Convert to markdown: torch.compiler_transformations.rst, torch.compiler.config.rst (#155347)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155347
Approved by: https://github.com/svekars
2025-06-11 18:55:30 +00:00
925fbfca27 Convert fx.rst to fx.md (#155482)
Part of changes #155023 (parent PR #155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155482
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-11 18:46:35 +00:00
4d9d884c3f [NCCL] Expose new ncclConfig_t flags in 2.27 (#155379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155379
Approved by: https://github.com/Skylion007
2025-06-11 18:26:55 +00:00
247f83e0a4 [dynamic shapes] guard individual terms in sym_and; user-code-friendly sym_and/sym_or (#154737)
Previously when processing `sym_and(a, b, c)`, symbolic shapes wouldn't individually process a, b, and c and store their implications. This would lead us to data-dependent error on individual checks, e.g. we stored `u0 >= 0 & u0 <= 10`, but then couldn't figure out `u0 <= 10`.

This handles that, and also makes `sym_and/or` user-code friendly, for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154737
Approved by: https://github.com/laithsakka
2025-06-11 18:08:06 +00:00
c1cbaca7fd [CI] Move setuptools requirements from conda to pip (#155697)
Needed for `import z5` to work without warning, otherwise
`LoggingTests.test_logs_out` will fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155697
Approved by: https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601, #155515
2025-06-11 18:03:18 +00:00
3a43dba21f Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit dc5e8f7999cccb51efcf0f5fe197a740a918c73d.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/malfet due to It broke ROCM, see c75c732481/1 ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2963708109))
2025-06-11 18:01:08 +00:00
c75c732481 [CI] Disable ET tests (#155708)
I'm tired of seeing red on PRs and it has been consistently broken since May 30th per 59eb61b2d1/10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155708
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-06-11 17:56:52 +00:00
59eb61b2d1 [inductor] Improve GEMM logging to display batch size for batched operations (#155544)
Improves the GEMM overview logging in PyTorch Inductor to properly display batch size information for batched matrix operations like `torch.bmm` and `torch.baddbmm`.

**Fixes #155307**

## Problem

The current GEMM logging for `torch.bmm` shows:
```python
# Repro
import os
os.environ["TORCH_LOGS"] = "inductor"
import torch

M, N, K = 1024, 1024, 1024
dtype = torch.bfloat16
A = torch.randn(10, M, K, device="cuda", dtype=dtype)
B = torch.randn(10, K, N, device="cuda", dtype=dtype)

compiled_model = torch.compile(torch.bmm, fullgraph=True)
_ = compiled_model(A, B)
```

**Before:**
```
Name                 | M                    | N                    | K                    | Count
----------------------------------------------------------------------------------------------------
aten.bmm             | 1024                 | 1024                 | 1024                 | 1
----------------------------------------------------------------------------------------------------
```

The batch size (10) is missing from the logs, making it unclear what the actual operation dimensions were.

## Solution

**After:**
```
Name                           | B                    | M                    | N                    | K                    | Count
----------------------------------------------------------------------------------------------------------------------------------
aten.bmm                      | 10                   | 1024                 | 1024                 | 1024                 | 1
aten.mm                       | -                    | 1024                 | 1024                 | 1024                 | 2
----------------------------------------------------------------------------------------------------------------------------------
```

## Changes Made

### 1. Enhanced Parsing Logic in compile_fx.py
- Detects batched operations by checking if operation name ends with `'bmm'` or `'baddbmm'`
- For batched operations: takes last 4 parts as `batch, m, n, k`
- For non-batched operations: takes last 3 parts as `m, n, k`
- **Dedicated "B" column**: Added separate column for batch size instead of embedding in operation name
- Shows batch size for batched operations, shows "-" for non-batched operations

### 2. Updated All MM Operations for Consistency
- **bmm.py**:
  - Extract batch size from `mat1.get_size()[0]` for both `tuned_bmm` and `tuned_baddbmm`
  - Use positional counter keys: `aten.bmm_{batch_size}_{m}_{n}_{k}`
  - Enhanced log messages to include batch size information

- **mm.py**: Updated counter keys for consistency:
  - `aten.mm_{m}_{n}_{k}` (no batch dimension)
  - `aten.addmm_{m}_{n}_{k}` (no batch dimension)
  - `aten._int_mm_{m}_{n}_{k}` (no batch dimension)
  - `aten._scaled_mm.default_{m}_{n}_{k}` (no batch dimension)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155544
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-06-11 16:57:40 +00:00
7b7cd56f5e [export] support linear & layer_norm unbacked (#155260)
## What
- use `definitely_contiguous_for_memory_format` instead of `is_contiguous` when the non-contiguous case is fine if we encounter a DDE.
- use ref's contiguous over Aten's contiguous because Aten's version will DDE and stop tracing. ref's version will use `definitely_contiguous_for_memory_format` and clone if there's a DDE.

## Example DDEs

- Fixed with `definitely_contiguous_for_memory_format` in `fast_binary_impl`
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq((u0//387), 0) (unhinted: Eq((u0//387), 0)).  (Size-like symbols: u0)

Caused by: layer_norm = self.layer_norm(linear)  # caffe2/test/export/test_export.py:4566 in forward (_subclasses/fake_impls.py:1022 in fast_binary_impl)
```

- Fixed with `refs.contiguous` instead of calling aten's contiguous (that'd require a bigger re-write in Aten)
```
  File "c10/core/TensorImpl.h", line 825, in torch::autograd::THPVariable_contiguous(_object*, _object*, _object*)
  File "c10/core/SymbolicShapeMeta.h", line 87, in c10::TensorImpl::is_contiguous_default(c10::MemoryFormat) const
  File "c10/core/SymbolicShapeMeta.cpp", line 250, in c10::SymbolicShapeMeta::init_is_contiguous() const

torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(128*((u0//387)), 0) (unhinted: Eq(128*((u0//387)), 0)).  (Size-like symbols: u0)

Caused by: (_refs/__init__.py:3302 in native_layer_norm)
```

- Fixed with `definitely_contiguous_for_memory_format` in ref's contiguous
```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression 387*((u0//387)) < 2 (unhinted: 387*((u0//387)) < 2).  (Size-like symbols: u0)

Caused by: (_prims_common/__init__.py:279 in is_contiguous)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155260
Approved by: https://github.com/laithsakka
ghstack dependencies: #155499
2025-06-11 16:47:34 +00:00
b49edc0d6c [Export] Fix some typos in docstring (#155485)
Summary: nit change, fix the doc string

Test Plan:
CI

Rollback Plan:

Differential Revision: D76297740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155485
Approved by: https://github.com/ColinPeppler
2025-06-11 16:44:38 +00:00
18bf6addc4 set_grad_enabled add str and repr for prints (#155681)
Fixes #86718

## Test Result

```python
>>> import torch
>>> torch.set_grad_enabled(False)
torch.autograd.grad_mode.set_grad_enabled(mode=False)
>>> print(torch.set_grad_enabled(False))
torch.autograd.grad_mode.set_grad_enabled(mode=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155681
Approved by: https://github.com/soulitzer
2025-06-11 16:01:03 +00:00
dc5e8f7999 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-11 15:20:48 +00:00
45c5a23237 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit 5264f8cd8d08272003298cdefe6bd60b1b8c80b4.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Just testing if it will fix PR time benchmarks signal ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2963232606))
2025-06-11 15:18:47 +00:00
359e8f5d69 [CI] Use setup-python from test-infra to do MacOS builds (#155515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155515
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #155476, #155493, #155601
2025-06-11 15:11:38 +00:00
9328a7fb58 [triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510)
This splits out the tests so we can independently test both the new and old API.

Note: the new API doesn't work yet - we still need to fix those tests.

Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510
Approved by: https://github.com/oulgen
2025-06-11 13:52:15 +00:00
4c3da611c2 Add CUDA 12.9.1 x86 nightly binaries (#154980)
Adding CUDA 12.9.1 to nightly binaries matrix for linux (x86) builds.
Add sbsa and libtorch build docker images, builds addition will be follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154980
Approved by: https://github.com/eqy, https://github.com/atalman
2025-06-11 13:43:17 +00:00
013cf1e330 [MPS] Move expm1 op to Metal (#155611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155611
Approved by: https://github.com/malfet
2025-06-11 13:06:14 +00:00
44df7cf28d [AOTI] Fix embed_kernel_binary error when max_autotune is ON (#155569)
Summary: Stop removing cubin files so that it won't be missing when max_autotune is ON.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155569
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-06-11 12:27:36 +00:00
f34ab1628b [Graph Partition] move cpu scalar tensor to gpu (#154464)
cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops.

This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Close: #119241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-06-11 10:22:45 +00:00
eaceb243df [BE] Update the XPU support package to 2025.1.3 (#154346)
Fixes #153632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154346
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-06-11 09:46:18 +00:00
2585960b47 remove redundent type_id (#155539)
Those were added in https://github.com/pytorch/pytorch/pull/92229 to prevent confusion of overloads.
but the variants that accepts SymBool are all removed in https://github.com/pytorch/pytorch/pull/112890
with the introduction of SymbolicShapeMeta.
Hence that dummy arg is not needed anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155539
Approved by: https://github.com/ezyang
2025-06-11 08:46:56 +00:00
717a099d42 Revert "[flex attention][triton pin] triton_helpers shim for TMA apis (#154858)" (#155640)
This reverts commit ea7b233015ff00098df687884be4e2efbf7a55fa.

It fails internal tests in fbcode, but they weren't running so we didn't get signal

Reverting w/ a PR/diff because the PR has been landed for ~1 week - too old to revert directly from internal.

Differential Revision: [D76380887](https://our.internmc.facebook.com/intern/diff/D76380887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155640
Approved by: https://github.com/nmacchioni, https://github.com/danzimm
2025-06-11 07:37:47 +00:00
0e2013a12d Add helion x pt2 test (#155513)
This kinda just worked out of the box, shocking. PT2 traced into helion and emitted it as a user defined triton kernel: P1836496774

In the long run, we do not actually want this, but rather to create a helion HOP so we can do fusions etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155513
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-11 07:08:06 +00:00
5b9db4335e Include c++ stack traces when we hit constraint violation (#155603)
Example new error message

```
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.

Framework stack:
  File "??", line 0, in _start
  File "", line 0, in __libc_start_main_alias_2
  File "??", line 0, in __libc_start_call_main
  File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain
  File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod
  File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module>
    foo(x)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper
    return fn(*args, **kwargs)
  File "offloadstuff.c", line 0, in dynamo__custom_eval_frame
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__
    return self._torchdynamo_orig_callable(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__
    result = self._inner_convert(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__
    return _compile(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform
    tracer.run()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run
    super().run()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run
    while self.step():
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl
    self.push(fn_var.call_function(self, self.popn(nargs), {}))
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
    return handler(tx, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda>
    return lambda tx, args, kwargs: obj.call_function(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
    return handler(tx, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph
    return wrap_fx_proxy(tx, proxy)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls
    return _wrap_fx_proxy(
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value
    ret_val = wrap_fake_exception(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception
    return fn()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node
    return node.target(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL
  File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1
  File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS
  File "python_variable_methods.cpp", line 0, in _object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object*, _object*, _object*)
  File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object*, _object*, _object*)
  File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&)
  File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
  File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
  File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
  File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName)
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl
    return self._dispatch_impl(func, types, args, kwargs)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl
    return maybe_propagate_real_tensors(fast_impl(self, *args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl
    final_shape = infer_size(final_shape, shape)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size
    torch._check(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check
    _check_with(RuntimeError, cond, message)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with
    if expect_true(cond):
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true
    return a.node.expect_true(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true
    return self.guard_bool(file, line)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool
    r = self.evaluate()
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate
    return self.shape_env.evaluate_sym_node(self, size_oblivious)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node
    return self.evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr
    return self._inner_evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr
    return self._evaluate_expr(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr
    self._maybe_guard_rel(g)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel
    self._refine_ranges(expr)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges
    self._set_replacement(
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement
    self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True)
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
  File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract
    torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp),
  File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
  File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
  File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call
  File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
  File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> (*)(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)
  File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool)
  File "??", line 0, in torch::unwind::unwind()

User stack:
  File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo
    return torch.randn(5) * x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155603
Approved by: https://github.com/zou3519, https://github.com/cyyever
ghstack dependencies: #155133
2025-06-11 05:00:36 +00:00
84c14361c2 [ez][AOTI] Add test for std::nullopt return in custom op (#155636)
Summary: As title. Follow up of https://github.com/pytorch/pytorch/pull/154286

Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_nullopt_output

Rollback Plan:

Differential Revision: D76378892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155636
Approved by: https://github.com/zou3519, https://github.com/cyyever
2025-06-11 03:52:31 +00:00
1e690b6c41 Replace TORCH_INTERNAL_ASSERT with TORCH_CHECK in set_history (#155453)
Fixes #154357

## Test Result

```bash
>>> import torch
>>>
>>> x = torch.tensor(1, device=torch.device('cpu'))
>>> y = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
>>> z0 = (x.abs() * y).prod(dtype=torch.int16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Autograd not support dtype: Short
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155453
Approved by: https://github.com/albanD, https://github.com/soulitzer
2025-06-11 03:46:48 +00:00
110ae0f433 Custom Op handle 1-element tuples (#155447)
Fixes #150472

Modification of [PR 151408](https://github.com/pytorch/pytorch/pull/151408). This PR modifies the return parsing in `infer_schema` to handle the case of a Tuple with a single element.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155447
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-06-11 03:43:40 +00:00
a2b0b2698d inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
2025-06-11 01:33:24 +00:00
5264f8cd8d Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-11 01:22:06 +00:00
3040ca6d0f [Cutlass] Include fp8 headers in aoti cpp wrapper (#155173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155173
Approved by: https://github.com/desertfire
ghstack dependencies: #154829, #154835, #155195
2025-06-11 01:21:16 +00:00
1ed243f01c Add missing attr access check for legacy autograd.Function (#155055)
Fixes https://github.com/pytorch/pytorch/issues/154981
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155055
Approved by: https://github.com/albanD
ghstack dependencies: #154509, #154852
2025-06-11 01:03:49 +00:00
5dd07c70e5 [dynamo] added github_cli to detect unimplemented_v2 calls (#155610)
This PR adds the workflow of whenever a dev makes a PR that contains files under torch/_dynamo, we check for any unimplemented_v2() callsites and if any of them have been modified in some sort of way, the workflow fails and lists them exactly which callsites and let's them know what the command lines are to update the registry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155610
Approved by: https://github.com/williamwen42
2025-06-11 00:40:56 +00:00
3580b8dde4 [BE] Mention debug=True in AC error messages (#155593)
See https://github.com/pytorch/pytorch/issues/155171#issuecomment-2949415407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155593
Approved by: https://github.com/janeyx99
2025-06-11 00:32:41 +00:00
dbec08bc1c Changes to HFStorageWriter to support saving shards of tensors (#154742) (#155566)
Summary:

As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
    -  as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now

- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.

Test Plan: test_hf_storage.py

Reviewed By: saumishr

Differential Revision: D75099862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155566
Approved by: https://github.com/saumishr
2025-06-10 23:37:47 +00:00
2161be8497 Move unary trig ops to metal kernels (#154465)
Move inverse trig unary ops, sinh, & cosh to metal kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154465
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-10 22:56:59 +00:00
c4b93e6579 Replace frame_traced_fn hook with get_traced_code() util (#155249)
#153622 introduced a hook for getting the relevant code objects after frame tracing. The idea is to have vLLM use this instead of monkey-patching `inline_call_()` to determine the source code files to hash. Unfortunately, the hook runs too late; the vLLM backend needs access to the set of source code filenames while it's running.

This PR replaces the newly-added hook with a utility function that a backend can call to get this information. I've made the change in vLLM and can verify that this allows the information to be queried at the right time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155249
Approved by: https://github.com/zou3519
2025-06-10 22:40:58 +00:00
8892b782a8 [nativert] move execution planner to torch (#155374)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revidsion: D76167093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155374
Approved by: https://github.com/zhxchen17
2025-06-10 22:36:06 +00:00
ffc6cbfaf7 [symm_mem] Move all symm mem code into a dedicated folder (#155573)
We arrive at a point when so many files are related to symmetric memory and files are scattered around in the cpp side. Let's first put all related code (symmetric memory related) into a separate folder. We can do further refactoring later if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155573
Approved by: https://github.com/fegin, https://github.com/d4l3k
2025-06-10 22:30:11 +00:00
3e131f7779 [CI] Move tlparse to requirements files (#155601)
Not sure why we had it that way to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155601
Approved by: https://github.com/seemethere
ghstack dependencies: #155476, #155493
2025-06-10 22:25:47 +00:00
94da4523ec Disable foreach tests that depend on profiler for CUDA 12.6 (#155596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155596
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-06-10 22:21:06 +00:00
672ac2ec86 Reapply "Cleanup VS 2019 refs in pytorch (#145863)" (#152613) (#155478)
This reverts commit e4f2282.
I believe fix PR was landed https://github.com/pytorch/pytorch/pull/153480 that triggered the revert.
Hence this is reland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155478
Approved by: https://github.com/malfet
2025-06-10 22:20:14 +00:00
3b7c5e6fa5 Revert "[inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182)"
This reverts commit b07725a9516028a485153c4b5356b3e33b990f81.

Reverted https://github.com/pytorch/pytorch/pull/155182 on behalf of https://github.com/davidberard98 due to fails on triton 3.2 (internally) ([comment](https://github.com/pytorch/pytorch/pull/155182#issuecomment-2960664845))
2025-06-10 21:53:01 +00:00
d2f06d2b06 [logs] Change autotune data into separate items (#155525)
Summary: Split the autotune data into multiple keys and items : this is better for storage of the data and easier querying.

Test Plan:
```
 TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run (sample)
```

Rollback Plan:

Differential Revision: D76303514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155525
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2025-06-10 21:47:07 +00:00
14f3639e09 Convert to .md: onnx_verification.rst, onnx.rst, package.rst, (#155556)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [onnx_verification.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_verification.rst)
* [onnx.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx.rst)

* [package.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/package.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155556
Approved by: https://github.com/AlannaBurke, https://github.com/sekyondaMeta
2025-06-10 21:40:40 +00:00
ae0f1f8984 Convert to markdown onnx rst (#155228)
Fixes #155030

Converts the following files to MyST markdown and ensure that the doc tests are green:

- [x] [onnx_dynamo_onnxruntime_backend.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo_onnxruntime_backend.rst)
- [x] [onnx_dynamo.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_dynamo.rst)
- [x] [onnx_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_ops.rst)
- [onnx_torchscript_supported_aten_ops.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript_supported_aten_ops.rst) - not changed as it is autogenerated
- [onnx_torchscript.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/onnx_torchscript.rst) - fixed in #155390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155228
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 21:33:07 +00:00
7a03b0d2ca [BE] Remove CUDA 11 artifacts. Fix Check Binary workflow (#155555)
Please see: https://github.com/pytorch/pytorch/issues/147383

1. Remove CUDA 11 build and test artifacts. One place CUDA 12.4
2. Fix Check Binary Workflow to use Stable Cuda version variable rather then hardcoded one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155555
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-06-10 21:32:08 +00:00
40fefe2871 Revert "[BE] Update cudnn to 9.10.1.4 (#155122)"
This reverts commit 73220d52fd67b5f4f5b15e0e0433e09733c93f31.

Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/atalman due to wrong pr description ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2960592004))
2025-06-10 21:13:18 +00:00
8a396c5635 DOC: Convert to markdown: torch.compiler_best_practices_for_backends.rst, torch.compiler_cudagraph_trees.rst, torch.compiler_custom_backends.rst, torch.compiler_dynamic_shapes.rst, torch.compiler_dynamo_deepdive.rst (#155137)
Fixes #155037

[torch.compiler_best_practices_for_backends.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/torch.compiler_best_practices_for_backends.rst) shows error 404

cc  @svekars @sekyondaMeta @AlannaBurke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155137
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 20:51:05 +00:00
01b8f5e685 Convert to markdown: testing.rst, threading_environment_variables.rst, torch_cuda_memory.rst, torch_environment_variables.rst, torch_nccl_environment_variables.rst (#155523)
Fixes #155035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155523
Approved by: https://github.com/AlannaBurke, https://github.com/svekars
2025-06-10 20:38:36 +00:00
545fbd58dc [export] inline jit.scripted function in export (#155180)
When we export a scripted function, we inline the original callable stored in "_torchdynamo_inline", this is the same strategy as torch.compile path.

We do the same thing for script method, where a "\_\_wrapped\_\_" attribute points to the original callable in most cases. There are some corner cases we identified: top-level jit.scripted modules' method doesn't have a \_\_wrapped\_\_. In this case, we fall back to the original scripted approach. Maybe there're more such cases but need verification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155180
Approved by: https://github.com/zou3519
2025-06-10 20:34:12 +00:00
a666cf3b38 [xla hash update] update the pinned xla hash (#154348)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154348
Approved by: https://github.com/pytorchbot
2025-06-10 20:33:31 +00:00
c9404faacb [refactor] is_known_channels_last_contiguous* -> definitely_channels_last_contiguous* (#155499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155499
Approved by: https://github.com/laithsakka
2025-06-10 20:29:46 +00:00
94763f5ca7 [ROCm][Inductor][CK] add kBatch as runtime parameter to CK-tile gemms (#155389)
Similar to old-CK gemms

### Testing

Rely on existing coverage in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155389
Approved by: https://github.com/chenyang78
2025-06-10 20:25:02 +00:00
ab51a93737 [CI] Set PATH during build to include location of sccache wrapped nvcc (#155464)
Sccache wasn't working for nvcc on jammy, so manually set the path to include where nvcc is

I had problems with always making nvcc a wrapper in some inductor tests where I got
```
sccache: encountered fatal error
sccache: error: PCH not supported by nvcc
sccache: caused by: PCH not supported by nvcc
```
and I also got an error (only on clang) when trying to set CMAKE_CUDA_COMPILER_LAUNCHER to /opt/cache/bin/sccache or sccache
```
ccache: error: failed to execute compile
    sccache: caused by: Compiler not supported: "nvcc warning : Support for offline compilation for architectures prior to \'<compute/sm/lto>_75\' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\nnvcc fatal   : Failed to preprocess host compiler properties.\n"
```

Non jammy cuda jobs' docker images used a different dockerfile, which set CMAKE_CUDA_COMPILER_LAUNCHER e895e9689c/.ci/docker/ubuntu-cuda/Dockerfile (L110)

Alt solution:
Given that I only get the error on clang, I could set CMAKE_CUDA_COMPILER_LAUNCHER=sccache only when not using clang

Setting CUDA_NVCC_EXECUTABLE doesn't fail but also doesn't result in cache hits/misses

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155464
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-06-10 20:23:33 +00:00
35e8f2593c [CUDA] Fix missing bounds check in Softmax.cu (#154778)
Uncovered by @ngimel, same as change in #144009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154778
Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/malfet
2025-06-10 20:03:54 +00:00
0ca2a79f5b [TEST] Modernize test_sort_large (#155546)
Since its introduction ~4 years ago, the test `test_sort_large` has always been deselected because it requires 200GB of CUDA memory. Now, as we do have GPUs this big, it gets selected, but fails with `var_mean` not being a member if `torch.Tensor` and `var_mean` accepting only floating point tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155546
Approved by: https://github.com/eqy
2025-06-10 19:59:12 +00:00
ea23eb4b98 [ez][CI] Reuse old whl: turn off on releases, add docs files to ok list (#155346)
Add docs/**/*.md and docs/**/*.rst to files that are ok to reuse old whls

Prevent using old whls on release branches

Move check for changed files earlier to reduce api usage?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155346
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-06-10 19:57:40 +00:00
8a22551300 Fixes OpInfo gradient checks for ctc_loss (#154590)
Fixes #67462

Re-enables `OpInfo` gradient checks for the restricted scenarios where the current `ctc_loss` implementation is accurate and consistent.

The desired `ctc_loss` gradient behavior appears to be an ongoing discussion, see
https://github.com/pytorch/pytorch/issues/52241. The `OpInfo` gradient checks can be updated if/as the underlying implementation advances.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154590
Approved by: https://github.com/soulitzer
2025-06-10 19:56:39 +00:00
954ce94950 Add __main__ guards to quantization tests (#154728)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In quantization tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154728
Approved by: https://github.com/ezyang
2025-06-10 19:46:07 +00:00
07eb374e7e [dynamo] Avoid unncessary caching source codegen (#155376)
We only need to cache a source (e.g., `x.y.z`) into a temporary local if
it's used multiple times in the codegen, otherwise we'd just be creating
redundant `DUP` and `STORE_FAST tmp_...` instructions, which might
degrade perf and definitely makes generated bytecode harder to read.

Example:
```python
import torch

@torch.compile(backend="eager")
def fn(x, y):
    return x + y

fn(torch.ones(2), torch.ones(1))
```

Original bytecode:
```verbatim
[0/0] [__bytecode]   3           0 RESUME                   0
[0/0] [__bytecode]
[0/0] [__bytecode]   5           2 LOAD_FAST                0 (x)
[0/0] [__bytecode]               4 LOAD_FAST                1 (y)
[0/0] [__bytecode]               6 BINARY_OP                0 (+)
[0/0] [__bytecode]              10 RETURN_VALUE
```

Modified bytecode (before this patch):
```verbatim
[__bytecode]   3           0 RESUME                   0
[__bytecode]               2 LOAD_GLOBAL              1 (NULL + __compiled_fn_1_578c8d9a_2a9b_4d15_bac7_267591cdee32)
[__bytecode]              14 LOAD_FAST                0 (x)
[__bytecode]              16 COPY                     1
[__bytecode]              18 STORE_FAST               3 (tmp_1)
[__bytecode]              20 LOAD_FAST                1 (y)
[__bytecode]              22 COPY                     1
[__bytecode]              24 STORE_FAST               4 (tmp_2)
[__bytecode]              26 PRECALL                  2
[__bytecode]              30 CALL                     2
[__bytecode]              40 STORE_FAST               2 (graph_out_0)
[__bytecode]              42 LOAD_FAST                2 (graph_out_0)
[__bytecode]              44 LOAD_CONST               1 (0)
[__bytecode]              46 BINARY_SUBSCR
[__bytecode]              56 DELETE_FAST              2 (graph_out_0)
[__bytecode]              58 RETURN_VALUE
```

Modified bytecode (after this patch):
```verbatim
[__bytecode]   3           0 RESUME                   0
[__bytecode]               2 LOAD_GLOBAL              1 (NULL + __compiled_fn_1_2c498af2_ce5c_49cb_abba_a0c7489b09ce)
[__bytecode]              14 LOAD_FAST                0 (x)
[__bytecode]              16 LOAD_FAST                1 (y)
[__bytecode]              18 PRECALL                  2
[__bytecode]              22 CALL                     2
[__bytecode]              32 STORE_FAST               2 (graph_out_0)
[__bytecode]              34 LOAD_FAST                2 (graph_out_0)
[__bytecode]              36 LOAD_CONST               1 (0)
[__bytecode]              38 BINARY_SUBSCR
[__bytecode]              48 DELETE_FAST              2 (graph_out_0)
[__bytecode]              50 RETURN_VALUE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155376
Approved by: https://github.com/williamwen42
2025-06-10 19:38:15 +00:00
91ee9ee82d [MPS][BE] Refactor round_decimals shader code to leverage new macro (#155316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155316
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #155304
2025-06-10 19:29:57 +00:00
b1b8e57cda Add __main__ guards to ao tests (#154612)
This is the first PR of a series in an attempt to get the content of #134592 merged as smaller PRs (Given that the original one was closed due to a lack of reviewers).

This specific PR contains:
- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Update ao tests.

There will be follow up PRs to update the other test suites but I don't have permissions to create branches directly on pytorch/pytorch so I can't create a stack and therefore will have to create them one at the time.

Cc @jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154612
Approved by: https://github.com/jcaip
2025-06-10 18:33:09 +00:00
0b677560e6 [inductor] use int64 for large index (#154575)
Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31.

Fix https://github.com/pytorch/pytorch/issues/154168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-06-10 18:30:43 +00:00
0f47e76937 [MPS] Implement hardshrink metal kernel (#155304)
Implements the forward and backward hardshrink operators as Metal kernels.
In order to support the lambda parameter, we extend the `exec_unary_kernel`  and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar.
We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304
Approved by: https://github.com/malfet
2025-06-10 18:20:27 +00:00
8347268edc Revert "Make open device registration tests standalone (#153855)"
This reverts commit 8823138e47a3200c313f6bf2d21eb689d8150f39.

Reverted https://github.com/pytorch/pytorch/pull/153855 on behalf of https://github.com/clee2000 due to causing some linux aarch64 tests to fail [GH job link](https://github.com/pytorch/pytorch/actions/runs/15566289293/job/43832373302) [HUD commit link](8823138e47), should be easy fix, rename in places where its mentioned, there might be more than just aarch64 though ([comment](https://github.com/pytorch/pytorch/pull/153855#issuecomment-2960191503))
2025-06-10 18:11:24 +00:00
cb9b479f4f XPU enable XCCL by default (#154963)
Enable USE_XCCL=ON by default when building PyTorch XPU binary, which is on par with NCCL for PyTorch CUDA binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154963
Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-10 17:56:13 +00:00
0b6c0898e6 [dynamo] added additional_info functionality (#155526)
There is now functionality for the developer to add a  --additional-info arg to the add and update dev terminal command to include any additional info the dev might want to remark about the graph break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155526
Approved by: https://github.com/williamwen42
2025-06-10 17:40:50 +00:00
8823138e47 Make open device registration tests standalone (#153855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153855
Approved by: https://github.com/janeyx99
2025-06-10 17:33:26 +00:00
c88e87f355 [Inductor] Set Triton Allocator Function For Use with New TMA API in Inductor (#155373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155373
Approved by: https://github.com/davidberard98
2025-06-10 17:09:04 +00:00
73220d52fd [BE] Update cudnn to 9.10.1.4 (#155122)
Follow up to #152782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-10 16:59:00 +00:00
38c4d05535 [precompile] Ensure @disable()-ed function won't trigger recompile from precompile bytecode. (#155363)
In a precompiled bytecode, it looks like the following:
```
pre-graph bytecode
...
compiled graph code
...
post-graph bytecode
```

In pre-graph bytecode we have calls into helper functions like torch._dynamo.utils.call_size which will invoke @disable inside the bytecode.

Normally torch.compile() will handle these frames fine, but for precompile we will load bytecode from a clean state of dynamo and we want a way to assert recompile never happen, so the current way to ensure this is by doing set_stance("fail_on_recompile") (open to any other idea to test this, but IMO this is the closest thing we have today).

This approach doesn't work when util functions like call_size() is involved and this PR fixes a bunch of places to make sure "fail_on_recompile" can skip through the functions meant to be skipped during compilation.

Differential Revision: [D76156867](https://our.internmc.facebook.com/intern/diff/D76156867/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155363
Approved by: https://github.com/jamesjwu, https://github.com/jansel
ghstack dependencies: #155329
2025-06-10 16:13:38 +00:00
ddee927f31 [precompile] Add low level C API to load precompiled dynamo code on functions. (#155329)
While loading deserialized dynamo states back from disk, precompile will need a direct way to access ExtraState and populate guarded bytecode as cache entries.

This diff adds two API at code level to load precompiled guard + bytecode entries.
1. _load_precompile_entry() will append an entry to a precompile entry list per code object. This precompile entry will be looked up before normal compiled entries.
2. _reset_precompile_entries() will clean up all the installed existing entries. This is useful to prevent a case where user call loading multiple times and explode the number of entries on the list.

Differential Revision: [D76083247](https://our.internmc.facebook.com/intern/diff/D76083247/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155329
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-06-10 16:13:38 +00:00
e8d29c45e0 [ROCm][TunableOp] Unit test to verify that there is only one kernel launch per PyTorch API invocation. (#155077)
TunableOp UT covers breakage that was fixed in this PR: https://github.com/pytorch/pytorch/pull/153764

After tuning is complete, verify that there is only one kernel launch. for each PyTorch API invocation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155077
Approved by: https://github.com/jeffdaily
2025-06-10 16:11:43 +00:00
08d15d3ec1 [Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155514)
Part of changes #155040 (parent PR #155120)

Follow-up of #155351. I split the changes of `torch.compiler_troubleshooting.rst ` into #155351 and this PR due to the 2000-line limit in one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155514
Approved by: https://github.com/svekars
2025-06-10 15:41:31 +00:00
eb152ab1dd Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 060838c2312ad207c7afe2c86f8a484afea5f328.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/clee2000 due to broke a bunch of tests internally D76299454, probably also broke rocm inductor/test_analysis.py::TestAnalysisCUDA::test_augment_trace_against_flop_counter_maxat0_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545277599/job/43766911025) [HUD commit link](060838c231) ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2959747153))
2025-06-10 15:38:40 +00:00
b44306d368 Add dont constant fold flag (#154945)
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops

Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-10 14:52:26 +00:00
68f36683f0 [Testing] Add more models to MPSInductor tests (#155494)
Enable all 46 HuggingFace models, only GPT2ForSequenceClassification fails to compile with a rather strange `array subscript is not an integer` error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494
Approved by: https://github.com/dcci
ghstack dependencies: #155476, #155493
2025-06-10 14:40:59 +00:00
c8d39a1045 [docs] Add docstring indicating UB for converting inf to int (#154781)
Fixes #154724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154781
Approved by: https://github.com/malfet
2025-06-10 14:04:50 +00:00
805297981a Revert "[Testing] Add more models to MPSInductor tests (#155494)"
This reverts commit f154f9b3040369a7979d5de7acb6fe21433eda83.

Reverted https://github.com/pytorch/pytorch/pull/155494 on behalf of https://github.com/malfet due to I'm blind ([comment](https://github.com/pytorch/pytorch/pull/155494#issuecomment-2959319787))
2025-06-10 13:45:32 +00:00
e53ddaf1f6 Adapt dtensor tests to be device agnostic (#154840)
##MOTIVATION
This PR includes minor changes to skip some unsupported tests on Intel Gaudi devices as well as to make some of the tests more device agnostic.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

##CHANGES
- test_dtensor_compile.py : Make some of the tests device agnostic . ( Replace "cuda" hard codings with self.device_type)
- test_dtensor.py and test_comm_mode_features.py: Skip some tests which are unsupported on Intel Gaudi devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154840
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD
2025-06-10 12:43:16 +00:00
f154f9b304 [Testing] Add more models to MPSInductor tests (#155494)
Enable all hugging face models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155494
Approved by: https://github.com/dcci
ghstack dependencies: #155476, #155493
2025-06-10 12:30:38 +00:00
75f258dd1f Fix spelling mistake (#155495)
Summary: Change "primtivies" to "primitives".

Test Plan:
n/a

Rollback Plan:

Differential Revision: D76229938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155495
Approved by: https://github.com/angelayi, https://github.com/cyyever
2025-06-10 09:06:58 +00:00
a205e8fd73 Apply all replacements on backward graph args during inductor codegen. (#155469)
Summary: temporary mitigation for https://github.com/pytorch/pytorch/issues/155468

Test Plan:
NA

Rollback Plan:

Differential Revision: D76096355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155469
Approved by: https://github.com/bobrenjc93
2025-06-10 08:56:18 +00:00
5116293f7e [XPU] Split triton version as 2 files to decouple triton version bump (#155313)
Triton XPU shares its version file with the community one. When the community updates Triton version, it will temporarily break the XPU CI/CD because they use different repositories and commits. To decouple Triton version bumps between the community and XPU, we propose splitting the version into two separate files.

Refer the latest community triton version bump PR #153117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155313
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/atalman
2025-06-10 08:49:03 +00:00
2cdcd16e83 [Easy] update pip sources for CUDA in nightly pull tool (#149143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149143
Approved by: https://github.com/ezyang, https://github.com/cyyever
ghstack dependencies: #145685
2025-06-10 08:07:30 +00:00
0319044e92 [Easy] update pip sources for ROCm in nightly pull tool (#145685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145685
Approved by: https://github.com/ezyang
2025-06-10 08:07:30 +00:00
9d2d227003 [CI] Fix XPU runner setup status issue (#155443)
Flow with PR #155194, fix the timeout exit code issue refer https://github.com/pytorch/pytorch/actions/runs/15526078422/job/43706927778?pr=154962#step:3:74
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155443
Approved by: https://github.com/etaf, https://github.com/atalman, https://github.com/EikanWang
2025-06-10 08:06:37 +00:00
5dfe1787b5 [Inductor] Limit fusions to a node distance of 64 (#154688)
fix for https://github.com/pytorch/pytorch/issues/154652 and https://fb.workplace.com/groups/1075192433118967/permalink/1484799079148049/

[window 128 dashboard run here w/ no regressions](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sun%2C%2001%20Jun%202025%2006%3A38%3A41%20GMT&stopTime=Sun%2C%2008%20Jun%202025%2006%3A38%3A41%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=mlazos/fuse-window&lCommit=8576f00ebfa53567d7bddc89d9882df9eb990561&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154688
Approved by: https://github.com/eellison, https://github.com/Raymo111
2025-06-10 07:32:23 +00:00
8b8684466a Add a stub AGENTS.md for Codex (#155459)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155459
Approved by: https://github.com/albanD, https://github.com/malfet
2025-06-10 07:20:21 +00:00
b07725a951 [inductor][triton pin] TMA shim refactor & mm, mm_scaled_grouped support (#155182)
Follow-up to #154858.

Triton 3.4 will provide a different API for TMA compared to Triton 3.3; the TMA shim in triton_helpers dispatches to the correct API.

First, this refactors the TMA shim to drop args that aren't supported from Triton 3.2 to Triton 3.4: in particular, strides (Triton 3.2 version doesn't accept non-contiguous inputs, so we just infer contiguous strides in Triton 3.4) and element_ty (Triton 3.4 doesn't support this arg, so in Triton 3.2 we just infer it from base_ptr).

Second, this updates mm.py & mm_scaled_grouped.py to use the TMA shim.

Differential Revision: [D76318784](https://our.internmc.facebook.com/intern/diff/D76318784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155182
Approved by: https://github.com/drisspg
2025-06-10 06:48:42 +00:00
8153340d10 [CI/CD] Remove CUDA 11.8 builds (#155509)
This removes CUDA 11.8 from CI/CD
Please see: https://github.com/pytorch/pytorch/issues/147383

TODO: Will followup of cleaning CUDA 11.8 config from scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155509
Approved by: https://github.com/cyyever, https://github.com/huydhn, https://github.com/malfet
2025-06-10 05:16:41 +00:00
671a9d175b Add warning for module full backward hook when no input requires gradient (#155339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155339
Approved by: https://github.com/Skylion007
2025-06-10 04:42:06 +00:00
e25ce0f928 [invoke_subgraph] Use eager input vals to constrain input strides (#155291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155291
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-10 04:06:09 +00:00
660695f11d Revert "Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)"
This reverts commit ede6ead8cd8e925cb093f2b3016342e645bd728d.

Reverted https://github.com/pytorch/pytorch/pull/155234 on behalf of https://github.com/clee2000 due to causing a bunch of tests to fail?  ex test_nn.py::TestNNDeviceTypeCUDA::test_variable_sequence_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545607752/job/43773157441) [HUD commit link](ede6ead8cd), some of the failures attributed to broken trunk on friday seem real? ([comment](https://github.com/pytorch/pytorch/pull/155234#issuecomment-2957578769))
2025-06-10 03:34:36 +00:00
76644c9ff5 Make require_contiguous require exact strides instead of stride order (#148424)
Make `require_contiguous` require exact strides instead of stride order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148424
Approved by: https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-06-10 02:36:18 +00:00
b916d8a583 [Testing] Shard MacOS inductor perf tests (#155493)
One dtype per shard, as current job takes 2+ hours to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155493
Approved by: https://github.com/dcci
ghstack dependencies: #155476
2025-06-10 02:26:22 +00:00
52edfb2cbc Updates to README about CUDA install dir and conda not required (#155458)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155458
Approved by: https://github.com/malfet
2025-06-10 01:30:34 +00:00
f34335bf33 Convert compiler rst files to markdown (#155335)
Convert following compiler rst files to md file.
torch.compiler_inductor_profiling.rst
torch.compiler_ir.rst
torch.compiler_nn_module.rst
torch.compiler_performance_dashboard.rst
torch.compiler_profiling_torch_compile.rst

Fixes #155039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155335
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-10 01:12:11 +00:00
1851f50866 [AOTI] Add int return type support for custom op in proxy executor (#155465)
Summary:
When a custom op has int return type in its schema. The returned value will be specialized and such behaviour is different from a symint return type. This diff **only added support for int return type**.

As the returned int will be specialized and fused into downstream kernels (if being used), we can simply skip the int return type in the proxy executor.

Note that in the eager run, the returned int will be specialized to the value defined in the real impl of the custom op. In exported program or in AOTI, the returned int will be specialized to the value defined in the fake impl of the custom op. So the definitions of the return value should be consistent across real and fake impl of the custom op. Otherwise the eager run and AOTI run will have different results.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_int_output
```

Rollback Plan:

Differential Revision: D76159406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155465
Approved by: https://github.com/angelayi
2025-06-10 01:07:15 +00:00
da50835bde [aoti] Support c10 calls (#155256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155256
Approved by: https://github.com/malfet
2025-06-10 00:45:59 +00:00
07e340e29c Build magma-cuda 129 (#155496)
followup for https://github.com/pytorch/pytorch/pull/155340
https://github.com/pytorch/pytorch/issues/155196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155496
Approved by: https://github.com/atalman
2025-06-10 00:32:24 +00:00
e7698ff5cf [MPS] Move abs op to Metal (#155474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155474
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-10 00:23:59 +00:00
7a48cc6990 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit b8bc2c2660e84034ff15232e2161e3ef9a6656d0.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2957346976))
2025-06-10 00:18:23 +00:00
a9a0501ec4 [user triton] mutation analysis for on-device TMA (#155380)
Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA.

On-device TMA works like this:

```
@triton.jit
def kernel(A_ptr, workspace_ptr, ...):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...)
    tl._experimental_descriptor_store(workspace_ptr, data, ...)
```

The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case.

The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example).

Consider this example, which I'll used to describe the pros/cons of this approach.

```
@triton.jit
def create_tma(global_ptr, workspace_ptr):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...)

@triton.jit
def kernel(A, B, workspace_ptr):
    create_tma(A, workspace_ptr)
    workspace_B = workspace_ptr + 128
    create_tma(B, workspace_B)
    data = tl._experimental_descriptor_load(workspace_ptr, ...)
    tl._experimental_descriptor_store(workspace_B, data, ...)
```

An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`).

A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor.

Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380
Approved by: https://github.com/oulgen
2025-06-10 00:07:18 +00:00
2578796e23 Fix sqlite3 in x86 Docker container. (#155211)
Some core modules for versions of python installed in /opt depend on libraries in /usr/local but those libraries are not copied over from the base container.

For example: /opt/python/cp312-cp312/bin/python3 -c "import sqlite3"
ImportError: libsqlite3.so: cannot open shared object file: No such file or directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155211
Approved by: https://github.com/huydhn
2025-06-09 23:42:02 +00:00
5df3bf13ec [Docs] Convert to markdown: torch.compiler_troubleshooting.rst (#155351)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155351
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-09 23:18:31 +00:00
e12597090c Revert "Update auto-tuning support for _scaled_grouped_mm (#150944)"
This reverts commit 09328eb02f5412d2211b5fd638ce82d0e03b9c1f.

Reverted https://github.com/pytorch/pytorch/pull/150944 on behalf of https://github.com/davidberard98 due to breaks internal usage & complicates triton pin update - more details in https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957246463 ([comment](https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957248841))
2025-06-09 23:12:56 +00:00
40d02eb481 [Cutlass] Allow filtering by fast_accum for scaled_mm (#155195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155195
Approved by: https://github.com/drisspg
ghstack dependencies: #154829, #154835
2025-06-09 22:46:18 +00:00
2c1a93a0ae Revert "[Graph Partition] move cpu scalar tensor to gpu (#154464)"
This reverts commit c1f531f0b0e6faf443d90f8de2936e866c8c27c2.

Reverted https://github.com/pytorch/pytorch/pull/154464 on behalf of https://github.com/clee2000 due to some of the newly added tests are failing internally, along with some other tests, D75913292 ([comment](https://github.com/pytorch/pytorch/pull/154464#issuecomment-2957201054))
2025-06-09 22:43:20 +00:00
82e6475d92 Add doc for missing functions for torch.special module (#155074)
Fixes #132178

Added all the missing functions that had a docstring but were not present in the documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155074
Approved by: https://github.com/albanD
2025-06-09 22:28:26 +00:00
bdbf2792a8 Fix docs build (#155129)
Not sure why the online doc build passes but it fails locally with these broken strings...

~Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129
Approved by: https://github.com/janeyx99, https://github.com/svekars
2025-06-09 22:25:20 +00:00
034a7f6437 [BE] Raise better exception in torch.[con]cat[enate] (#155460)
By replacing `TORCH_CHECK` with `TORCH_CHECK_VALUE`

Also make redispatching from aliases an even simpler, by just calling
respective original class

Addresses feedback raised in https://github.com/pytorch/pytorch/pull/155383/files#r2133952368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155460
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-09 22:18:00 +00:00
398fca9dcf Add almalinux CUDA 12.9 docker build, required for magma build (#155340)
https://github.com/pytorch/pytorch/issues/155196
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155340
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-09 22:10:24 +00:00
ede6ead8cd Move non inductor workflows cuda 12.6->cuda 12.8 (#155234)
Move non inductor workflows cuda 12.6->cuda 12.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234
Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet
2025-06-09 22:04:19 +00:00
060838c231 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-09 21:43:21 +00:00
b95dadd717 [MPS] Enable RProp test for non-contiguous (#155439)
I believe this issue has already been fixed, but I don't know the hero PR. I'm relying on ci signals to verify it's fixed across macOS versions.

Fixes #118117

xref #115350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155439
Approved by: https://github.com/Skylion007
2025-06-09 21:29:09 +00:00
3490a4f906 [MPS] Enable optimizer tests affected by addcdiv (#155437)
Tracked in #118115. Fixed in #124442. This PR unskips the tests.

xref #115350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155437
Approved by: https://github.com/Skylion007
2025-06-09 21:27:37 +00:00
b8bc2c2660 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-09 21:23:32 +00:00
eba5fc91ac [nativert] Move serialization to PyTorch core (#155229)
Summary:
Serialization contains utilities to deserialize a graph saved on disk in json format as defined in `torch/csrc/utils/generated_serialization_types.h` to the in-memory representation as defined in `torch/nativert/graph/Graph.h`

Test Plan:
buck2 run @mode/dev-nosan caffe2/test/cpp/nativert:serialization_test

Rollback Plan:

Differential Revision: D76012641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155229
Approved by: https://github.com/zhxchen17
2025-06-09 21:12:30 +00:00
1e6a653234 [ROCm][Inductor][CK] Split ck and ck-tile inductor backend(s) (#155294)
... and fix ck-tile instances not being generated due to incorrect caching

### Testing

Added test cases for CKTILE instances

```
pytest test/inductor/test_ck_backend.py -k gemm_backends_CKTILE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155294
Approved by: https://github.com/coconutruben
2025-06-09 20:40:26 +00:00
620415e018 Revert "Add stack_trace on make_fx (#155155)"
This reverts commit d4d0ede6bacb4b3b33c0e4aa4cb0e79d34e697ec.

Reverted https://github.com/pytorch/pytorch/pull/155155 on behalf of https://github.com/malfet due to Not sure why it was merged, it indeed breaks those tests in CI ([comment](https://github.com/pytorch/pytorch/pull/155155#issuecomment-2956973633))
2025-06-09 20:40:13 +00:00
abbdf9f363 [BE][Testing] Unskip ones_like/zeros_like testing on MPS (#155476)
But skip `double` dtype form OpInfo variants for this test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155476
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-09 20:37:44 +00:00
ea37f72099 enable test (#155342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155342
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #154768
2025-06-09 19:26:05 +00:00
d4d0ede6ba Add stack_trace on make_fx (#155155)
Summary:
Previosuly, we only add stack trace in `class _ModuleStackTracer(PythonKeyTracer)` for non-strict export. I moved this stack trace logic to the parent class `PythonKeyTracer`, this way the graph traced from Module using make_fx will have stack_trace as well.

Motivation: we've observed some uses cases where users first use `make_fx` on the Module, and then run `export` on the resulting graph. If the result of `make_fx` doesn't have stack trace, the stack trace information is lost.

Test Plan:
```
buck run test:test_export -- -r  test_stack_trace
```

Rollback Plan:

Differential Revision: D75985427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155155
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-09 18:31:57 +00:00
2aade5ee9f Fix weight tensor documentation #134896 (#155093)
Fixes #134896

## Description

Remove line about 'weight' tensor needing to be of floating point type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155093
Approved by: https://github.com/AlannaBurke
2025-06-09 18:07:21 +00:00
3863bbb55b [BE]: Update cusparselt to 0.7.1 (#155232)
Needed to support sparse operations on Blackwell, and implements new features for the library. Also optimizes library sizes vs 0.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155232
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-06-09 18:01:23 +00:00
79bdafe5b6 Revert "Custom FX pass for inductor's backend registration (#154841)"
This reverts commit e694280d1215caf70f41575f2611bfa26c69ebdb.

Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))
2025-06-09 16:56:45 +00:00
0083032e75 [aotd] Support mutations in reordering_to_mimic_autograd_engine (#155353)
Original issue: https://github.com/pytorch/pytorch/issues/154820

Dedicated sub-issue: https://github.com/pytorch/pytorch/issues/155242

Backward graph is reordered by partitioners.py: reordering_to_mimic_autograd_engine

Which only records in the backward graph compute that starts from tangents.

Mutation of primals(inputs) in backward can be disconnected from backward.

Handling this copy_ specifically, as we  add this mutation in framework and this is the only mutation that exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155353
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-06-09 16:39:47 +00:00
6c05f2fca0 [test] use JK to force graph break on slow aliasing/mutation/dynamic_shape behavior (#155257)
Summary: test to unblock shampoo, needs cleanup

Test Plan:
CI

Rollback Plan:
steps:
  - jk.update:
      jk: pytorch/compiler:aliased_inputs_with_mutation_and_dyn_shapes_killswitch
      constant_bool: null
      consistent_pass_rate: null
      fractional_host_rollout: null
      sampling_rate: null
  - manual.note:
      content: Set it to false.

Reviewed By: c00w

Differential Revision: D76051868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155257
Approved by: https://github.com/c00w
2025-06-09 16:21:59 +00:00
4a4cac0cef Update torch-xpu-ops commit pin (#154962)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@`a3a196`](a3a196ccdb) includes:

- Enhanced Adaptive Average Pooling 2D Backward Kernel for performance and code simplification
- Group Norm Backward Optimization with vectorization and parallel reduction
- Support CL path for MaxUnpooling2d and MaxUnpooling3d
- Rename USE_ONEMKL as USE_ONEMKL_XPU and set it as default ON
- Refactor USE_XCCL & USE_C10D_XCCL option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154962
Approved by: https://github.com/EikanWang
2025-06-09 15:54:13 +00:00
b9b84d8011 Generate unique id for tensor storage object by observing the week pointer of tensor storage object (#154859)
Summary:
PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors.
However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id.
This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D75749065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154859
Approved by: https://github.com/eellison, https://github.com/ngimel
2025-06-09 15:46:27 +00:00
79aef14169 [ONNX] Set the name of the producing node using the value name (#155413)
When comparing two graphs exported using different opset versions, even though the value names are the same in both graphs, the node names did not match, causing model-explorer to not be able to sync the two graphs. This change updates the names of the nodes that directly produce the output values, for better correspondence across exported graphs.

![image](https://github.com/user-attachments/assets/3c00ca18-221f-4add-8429-4bcf12069036)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155413
Approved by: https://github.com/cyyever, https://github.com/xadupre
2025-06-09 13:03:58 +00:00
e15848669f [1/n]adding torch.distributed.run option to provide destination for event logging (#154644) (#155268)
Summary:

**Problem Statement**
Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run.
*recording events to default destination:* https://fburl.com/code/7f9b0993
The default destination is "null".

***Solution***
adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line.

Test Plan:

https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION

Rollback Plan:

Reviewed By: kiukchung

Differential Revision: D75183591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155268
Approved by: https://github.com/d4l3k
2025-06-09 10:43:52 +00:00
9968c854b6 [Dynamo] Replace unimplemented with unimplemented_v2 in torch/_dynamo/variables/tensor.py (#153146)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/tensor.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153146
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <william.wen42@gmail.com>
2025-06-09 06:27:50 +00:00
9b4a748e29 [nativert] Move Weights to PyTorch core (#155156)
Summary:
Moves Weights class to PyTorch core
Torch Native Runtime RFC: pytorch/rfcs#72
 README: https://github.com/pytorch/pytorch/blob/main/torch/nativert/OVERVIEW.md

Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:weights_test

Differential Revision: D75973156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155156
Approved by: https://github.com/zhxchen17
2025-06-09 05:49:32 +00:00
6fb6293159 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit c6b4f98625bb6b22bb9a60112a6d58e684a97e1b.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/etaf due to This is breaking tests on xpu, detail log: https://hud.pytorch.org/pr/pytorch/pytorch/154962#43700962849 ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2954517883))
2025-06-09 03:13:27 +00:00
be2ad70cfa Fix dynamo tracing into AOTAutogradCache results in cpu tensors (#155251)
On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable:
05dd638ee9/torch/_dynamo/backends/common.py (L76)
This disables dynamo in the bw_compiler but also disables the runnable the compiler returns.

On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already.

```
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * fn /home/jjwu/test.py:9
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517]   * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35
I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ]

```

This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical.

Fixes https://github.com/pytorch/pytorch/issues/154536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155251
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-06-09 02:06:16 +00:00
2908c10259 Document the default garbage_collection_threshold value and improve the organization of cuda docs (#155341)
Fixes #150917

As mentioned in the issue, I've updated the documentation of `garbage_collection_threshold`and improved the organization.

Could you please review?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155341
Approved by: https://github.com/AlannaBurke, https://github.com/ngimel
2025-06-08 22:09:35 +00:00
d41f62b7a0 Fix/issue #155027 (#155252)
Fixes #155027
Converted RST files to Markdown

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155252
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-08 21:17:31 +00:00
3d82a1dfb5 Add checks for empty tensor list (#155383)
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b

Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
   556 	  bool has_names() const {
   557 	    // If a user is using unnamed tensors, then we can short-circuit right here.
   558 	    // Otherwise, impl::has_names attempts to retrieve names.
-> 559 	    if (!impl_->has_named_tensor_meta()) {
   560 	      return false;
   561 	    }
   562 	    return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
   20  	int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
   21  	  TORCH_CHECK(dim.type() != NameType::WILDCARD,
   22  	      "Please look up dimensions by name, got: name = None.");
-> 23  	  TORCH_CHECK(tensor.has_names(),
   24  	      "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
   25  	  const auto names = tensor.names();
   26
```

TODOs:
 - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
 - Replace  `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests

Fixes https://github.com/pytorch/pytorch/issues/155306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: #155382
2025-06-08 18:53:19 +00:00
95448b2ce6 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 65b1aedd09e98fcafcdd893ca4924f4fa598fd18.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/clee2000 due to see henry's comment above.  This was reverted internally because it causes a memory leak and OOMs on AMD? ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2954192879))
2025-06-08 17:37:29 +00:00
30293b8b5e Preserve Enum types during torch.export serialization and deserialization (#154821)
Fixes #154674

Addresses an issue where `torch.export` does not correctly preserve Python `Enum` types during the save/load round-trip. Previously, Enum inputs were serialized by value only, causing their type to be lost after deserialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154821
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007, https://github.com/yushangdi, https://github.com/angelayi
2025-06-08 17:30:31 +00:00
27df0c56b7 Revert "[inductor] use int64 for large index (#154575)"
This reverts commit 2596e3d0617852469241be8777cf46db5c83928c.

Reverted https://github.com/pytorch/pytorch/pull/154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](2596e3d061), note for self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154575#issuecomment-2954175761))
2025-06-08 16:58:59 +00:00
49888e6be0 [BE] Polish Makefile (#155425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155425
Approved by: https://github.com/ezyang
2025-06-08 16:37:12 +00:00
b981fb6744 Add docblock to torch/_dynamo/variables/builtin.py (#155402)
Add comprehensive module docstring explaining built-in function and type
variable tracking, including handling of Python built-ins, type constructors,
operators, and special constructs during symbolic execution.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155402
Approved by: https://github.com/Skylion007
ghstack dependencies: #155403
2025-06-08 15:24:29 +00:00
09328eb02f Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel
2025-06-08 10:18:13 +00:00
1339e88105 Add docblock to torch/_dynamo/side_effects.py (#155403)
Add comprehensive module docstring explaining side effect tracking and
management, including mutation tracking, context changes, aliasing,
and state preservation during symbolic execution.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155403
Approved by: https://github.com/williamwen42
2025-06-08 07:02:30 +00:00
0756ebcd48 Add docblock to torch/_dynamo/trace_rules.py (#155401)
Add comprehensive module docstring explaining the tracing rules and policies
that govern TorchDynamo's compilation decisions, including skip rules,
inlining policies, and library-specific handling.

Originally generated by claude but reviewed and edited by me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155401
Approved by: https://github.com/williamwen42
2025-06-08 04:30:03 +00:00
abf4da0d24 [Profiler] Induce Inductor Import before Profiling (#155243)
Fixes #151829
Summary:
Currently, inductor has a lazy init which causes certain aten ops to run during a profiling run. This ends up cluttering the function events especially for smaller traces. One of the attempts to fix this was to simply remove that import from the profiler entirely but it looks like the import happens somewhere downstream anyways and the event still flood our profile.

To fix this, we induce the inductor import during prepare trace if the inductor is present. This way regardless of how the workload imports the inductor the actual init process will be done before tracing starts, resulting in more accurate tracing.

Test Plan:
Added test, also ran N7316820 manually and went from getting many events on the first run to the following output (only difference is Runtime Triggered Module Loading which is CUPTI overhead event):

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
                                             aten::mul_         1.40%     340.638us        99.92%      24.390ms      24.390ms       1.535us       100.00%       4.605us       4.605us             1
                                       cudaLaunchKernel         0.60%     146.533us        98.52%      24.049ms      24.049ms       0.000us         0.00%       3.070us       3.070us             1
                       Runtime Triggered Module Loading         6.14%       1.500ms         6.14%       1.500ms       1.500ms       1.535us       100.00%       1.535us       1.535us             1
                       Runtime Triggered Module Loading        91.78%      22.403ms        91.78%      22.403ms      22.403ms       1.535us       100.00%       1.535us       1.535us             1
                       void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.535us       100.00%       1.535us       1.535us             1
                        cudaDeviceSynchronize         0.08%      20.031us         0.08%      20.031us      20.031us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
                                   aten::mul_        82.81%     484.396us        94.26%     551.378us     551.378us       1.440us       100.00%       1.440us       1.440us             1
                                   cudaLaunchKernel        11.45%      66.982us        11.45%      66.982us      66.982us       0.000us         0.00%       0.000us       0.000us             1
                                  void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.440us       100.00%       1.440us       1.440us             1
                                  cudaDeviceSynchronize         5.74%      33.581us         5.74%      33.581us      33.581us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Rollback Plan:

Differential Revision: D76056511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155243
Approved by: https://github.com/ngimel
2025-06-07 23:58:50 +00:00
f1f49e56b0 [CI] remove xfail sm89 job (#155244)
No need to collect more data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155244
Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/Skylion007
2025-06-07 21:04:57 +00:00
11bc29856d Fix some incorrect reST markups in the document (#154831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154831
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-07 19:09:46 +00:00
2596e3d061 [inductor] use int64 for large index (#154575)
Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31.

Fix https://github.com/pytorch/pytorch/issues/154168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-06-07 18:41:46 +00:00
cyy
f6e18bc105 Fix CUDA 12.8 docker tag (#155087)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155087
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2025-06-07 16:39:42 +00:00
783a4c1f50 [ROCm] fix nightly wheel, second attempt (#155388)
Fixes #155207. hipsparselt logic was still broken, but smoke test didn't catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155388
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 15:57:55 +00:00
ab56e5add9 [CUDA][BUILD] Add back the capability to use env TORCH_CUDA_ARCH_LIST (#155314)
Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build.

Follow up to: https://github.com/pytorch/pytorch/pull/152715

Before this PR,
On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using.

After this PR,
TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build.

p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314
Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman
2025-06-07 15:52:39 +00:00
456f40cb09 Add docblock for autotune_cache.py (#155133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155133
Approved by: https://github.com/aorenste
2025-06-07 14:50:09 +00:00
29e6033ff3 [Break XPU] Fix failed test cases which are introduced by community for XPU. (#155317)
Fixes #155186, Fixes #154701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155317
Approved by: https://github.com/jansel
2025-06-07 14:46:30 +00:00
694028f502 update get_default_device to also respect torch.device ctx manager (#148621)
Fixes https://github.com/pytorch/pytorch/issues/131328
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148621
Approved by: https://github.com/ezyang
2025-06-07 14:26:17 +00:00
db491825e0 [invoke_subgraph] Add logging (#155284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155284
Approved by: https://github.com/zou3519
ghstack dependencies: #155270
2025-06-07 11:31:53 +00:00
0f3f59784d [invoke_subgraph] Throw assertion on uncaptured speculate_subgraph (#155270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155270
Approved by: https://github.com/zou3519
2025-06-07 11:31:53 +00:00
c1f531f0b0 [Graph Partition] move cpu scalar tensor to gpu (#154464)
cudagraph does not support cpu tensors. In this PR, we update the graph by explicitly moving cpu tensors to gpu when profitable, relying on graph partition to split off this data copy, and cudagraphifying the remaining gpu ops.

This PR unblocked the graph partition + cudagraph on speech_transformer, leading to 39.5% speedup on inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Close: #119241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154464
Approved by: https://github.com/eellison
2025-06-07 06:59:39 +00:00
386aa72003 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Differential Revision: [D75985423](https://our.internmc.facebook.com/intern/diff/D75985423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-07 06:54:12 +00:00
da1f8980df [nativert] move function schema to torch (#154948)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D75826905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154948
Approved by: https://github.com/zhxchen17
2025-06-07 05:45:30 +00:00
5fbaa041e7 SDPA support gfx950 (#155103)
Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27

Test Plan:
```
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           179.737 |             182.874 |          3106.6  |          359.662 |              205.506 |        382.334 |          375.776 |       22.1205 |       191.067 |           334.392 |                17.2841 |                 16.9877 |               8.63754 |                   15.1169 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |           617.271 |             623.38  |          7169.73 |          998.961 |              654.534 |        445.312 |          440.947 |       38.3387 |       275.164 |           419.96  |                11.6152 |                 11.5014 |               7.17719 |                   10.9539 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |           667.032 |             670.118 |         13031.8  |         1383.42  |              768.452 |        412.091 |          410.193 |       21.0928 |       198.694 |           357.703 |                19.5371 |                 19.4471 |               9.42    |                   16.9586 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          2074.64  |            2214.81  |         29186.9  |         3916.35  |             2404.29  |        529.978 |          496.437 |       37.6714 |       280.749 |           457.313 |                14.0684 |                 13.1781 |               7.45257 |                   12.1395 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          2456.6   |            2472.38  |         51095.8  |         5647.01  |             3008.09  |        447.574 |          444.718 |       21.5186 |       194.707 |           365.518 |                20.7994 |                 20.6666 |               9.0483  |                   16.9861 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |          8048.8   |            8070.96  |        113478    |        15580.8   |             9768.71  |        546.423 |          544.922 |       38.7569 |       282.274 |           450.218 |                14.0987 |                 14.06   |               7.2832  |                   11.6165 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           692.323 |             697.649 |          4241.81 |          1562.26 |              906.441 |        248.148 |          246.254 |       40.5012 |      109.968  |           189.531 |                6.12693 |                 6.08015 |               2.71518 |                   4.67963 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |          2263.22  |            2267.38  |          9482.64 |          7003.8  |             2765.5   |        303.636 |          303.079 |       72.4687 |       98.1174 |           248.489 |                4.1899  |                 4.18221 |               1.35393 |                   3.42891 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |          2553.94  |            2572.68  |         15909.8  |          5697.16 |             3284.77  |        269.073 |          267.112 |       43.193  |      120.621  |           209.206 |                6.22953 |                 6.18415 |               2.79259 |                   4.84352 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          8187.67  |            8201.71  |         35449.2  |         26424.3  |            10364.5   |        335.722 |          335.147 |       77.5413 |      104.025  |           265.21  |                4.32959 |                 4.32218 |               1.34154 |                   3.42025 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          9948.15  |            9815.47  |         62815.1  |         23741.9  |            12710     |        276.31  |          280.046 |       43.7598 |      115.778  |           216.269 |                6.31425 |                 6.39961 |               2.64575 |                   4.94217 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |         32187.6   |           32035.6   |        137832    |        102075    |            40623.4   |        341.595 |          343.216 |       79.7716 |      107.716  |           270.66  |                4.28216 |                 4.30248 |               1.35031 |                   3.39293 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+

```

Rollback Plan:

Differential Rev,ision: D75934358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2025-06-07 03:47:29 +00:00
30387ab2e4 [ROCm] Adds initialization support for PyTorch when built from ROCm wheels. (#155285)
AMD is beginning to roll out ROCm distribution via Python wheels. This patch adds the `__init__.py` hook that is necessary to bootstrap ROCm correctly on Linux and Windows when built from these wheels.

See draft, developer documentation describing the mechanism here: https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md

This operates to similar effect as how Torch can depend on CUDA wheels, with some differences:

* ROCm libraries and checks are delegated to helpers in the `rocm_sdk` module, which knows how to find and configure access to the installed libraries. This limits the amount of plumbing and path machinations that must match up between the framework and ROCm.
* When building torch against ROCm, no ROCm system install is needed: instead the proper SDK development wheel is installed and the `CMAKE_PREFIX_PATH` is obtained via `rocm-sdk path --cmake`.
* It is expected that whoever produces such a build will also place a generated `_rocm_init.py` in the `torch` module with initialization logic to preload libraries, check versions, verify GPU compatibility, etc.
* See [build_prod_wheels.py](https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py) for an example build script that is being used to generate nightlies in this configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155285
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 02:59:03 +00:00
f140fac8dc [MPS] Implement erfc (#155382)
And migrate `erf` to Metal kernel

Use `erf` approximations from https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/erf.h as previous approximation did not match the CPU implementation

After that, `erfc(x) := 1.0 - erf(x)`

Fixes https://github.com/pytorch/pytorch/issues/155337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155382
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-06-07 02:35:12 +00:00
400f439670 [pt][easy] Rename metadata column (#155365)
Summary: Fixing typo: our logging requires autotuning_data instead of autotune_data, making it consistent

Test Plan:
Run benchmark, observe in perfetto trace proper name

Rollback Plan:

Differential Revision: D76159393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155365
Approved by: https://github.com/masnesral, https://github.com/Skylion007
2025-06-07 02:25:55 +00:00
81b0b308ca [dynamo] constant fold torch.cuda.is_initialized (#155300)
Fixes https://github.com/pytorch/pytorch/issues/129659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155300
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-07 02:21:11 +00:00
10cd1de518 [ROCm] Make optional features in LoadHIP better conditioned. (#155305)
* The `rocm-core` CMake package only started appearing in ROCm 6.4, so rework the version probing to work if it is not present. Also collapses the unneeded operating system conditioning in favor of feature probing.
* Make `hipsparselt` optional: it only started appearing in ROCm 6.4 and it is not in all recent distribution channels yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155305
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-07 02:20:55 +00:00
5596cefba6 Fix segfault during NumPy string tensor conversion (#155364)
By checking dtype first, but add elemnt_size check as well

Fixes https://github.com/pytorch/pytorch/issues/155328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155364
Approved by: https://github.com/Skylion007
2025-06-07 01:55:00 +00:00
be2e43264d [CI]Update windows runner to windows-2022 (#154368)
As per info in : actions/runner-images#12045
We need to change window runner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154368
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-07 01:39:19 +00:00
83d22256f8 [BE][Ez]: Improve typing in torch._logging (#155345)
Add a few missing returns in torch._logging and use ruff to infer the obvious ones.
LazyStr now properly checks the return type of the Callable and the args and kwargs passed to it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155345
Approved by: https://github.com/ezyang
2025-06-07 00:04:39 +00:00
9b4db093cb Add C shim for at::pad and fix some typos (#155226)
As stated, we would like a pad shim to support custom ops wanting to build in an ABI stable manner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155226
Approved by: https://github.com/desertfire
2025-06-06 23:08:39 +00:00
cd82096973 DOC: Convert to markdown: ddp_comm_hooks.rst, debugging_environment_variables.rst, deploy.rst, deterministic.rst, distributed.algorithms.join.rst (#155298)
Fixes #155017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155298
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-06 22:44:50 +00:00
457dd79927 [BE][Ez]: Remove unnecessary accesses of dim vector (#155334)
It's better because you return less date, encapsulate more, and no longer need special handling of symvec vs nonsym vec dim(). Also removes a few casts and fixes a few potential edge cases relating to unsigned comparisons

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155334
Approved by: https://github.com/ezyang
2025-06-06 21:28:25 +00:00
c95705dac2 [Docs] Convert to markdown: torch.compiler_troubleshooting_old.rst, torch.compiler.rst (#155348)
Part of changes #155040 (parent PR #155120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155348
Approved by: https://github.com/svekars
2025-06-06 21:26:24 +00:00
d2a2bfcb58 Turn on new tiling by default (#154768)
Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set.

TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red)

I am doing another run and will update here.

Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See:
```
(Pdb) p expr
((s25*s85)//32)
(Pdb) p FloorDiv(expr, expr)
((s25*s85)//(32*(((s25*s85)//32))))
```

and also - unbacked shape is not multiple of itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768
Approved by: https://github.com/jansel
2025-06-06 21:19:35 +00:00
bc5a11b581 [easy][invoke_subgraph] Remove skip from already fixed test (#155286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155286
Approved by: https://github.com/zou3519
2025-06-06 21:16:22 +00:00
0d8c029584 [FSDP2] keep root unsharded when not specifying reshard_after_forward (#155319)
for `fully_shard(model)` without explicitly setting `reshard_after_forward=True/False`, we keep root unsharded. When user explicitly set `reshard_after_forward`, we respect it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155319
Approved by: https://github.com/mori360
2025-06-06 20:29:31 +00:00
4f5b34427b DOC: Convert to markdown: torch.overrides.rst, type_info.rst, utils.rst, xpu.rst (#155088)
Fixes #155041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155088
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-06 20:16:13 +00:00
067fd0b3ab [dynamo][cleanup] Simplify disabling of the helper functions on tensor properties (#155259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155259
Approved by: https://github.com/zhxchen17
2025-06-06 19:44:40 +00:00
749757ac1b [a2av] Align length of major dimension in output of 2D a2av (#155172)
Downstream consumer of the 2D all-to-all-v is often a group GEMM.
Today the GEMM often have an alignment requirement on the chunk sizes within grouped sequence, where each chunk carries the tokens headed for an expert. For example, `torch._group_mm` requires an alignment of 8.

This PR adds that alignment capability, when user passes in a `major_align` argument, so that no extra padding step is needed.

The key in supporting that is making the output offsets aligned to such value. (Output offsets are returned to the users in the 3rd row of `in_out_splits`, on device. The 2nd row, output splits, are unaffected by this alignment value -- i.e. reflecting true number of tokens for an expert.)

The algorithm is as follows.

![502413288_678786854922438_530852083153996358_n](https://github.com/user-attachments/assets/557624a3-150e-4ab6-ba8b-1dbaa5ac01ac)

In detailed implementation, we use warp scan to calculate prefix sum on the "block" illustrated above. As a result, the "block" size, i.e. `npes` is currently limited to warp size 32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155172
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677, #155058
2025-06-06 19:39:44 +00:00
1ccc57e428 Log backward no-op to tlparse and pt2 compile events. (#154544)
Summary: Log backward no-op to tlparse and pt2 compile events.

Test Plan:
$ rm -rf /tmp/r && TORCH_TRACE=/tmp/r buck2 run //scripts/jovian:backward_noop_repro_compile

Used print statements to verify we enter the logging code region.

Differential Revision: D75231665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154544
Approved by: https://github.com/c00w
2025-06-06 18:08:19 +00:00
2e2ea7290a [Inductor] Support autotuning in the FX backend. (#155049)
# Feature
If `config.triton.autotune_at_compile_time` is set to `True`, autotune Triton kernels during FX conversion. Else, stick with the existing behavior of using the first precompiled config.

# Test plan
Added CI tests verifying that the tuner is called iff this flag is set, with and without dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155049
Approved by: https://github.com/jansel
2025-06-06 17:44:14 +00:00
453bc9fbdf [a2av] 2D all-to-all-vdev (#155058)
A 2D AllToAllv shuffle is illustrated below:
(`world_size` = 2, `ne` = 2, where `ne` is number of experts per rank)
```
        Source: |       Rank 0      |       Rank 1      |
                | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 |

        Dest  : |       Rank 0      |       Rank 1      |
                | c0 | d0 | c1 | d1 | c2 | d2 | c3 | d3 |
```
where each `c_i` / `d_i` are slices of the `input` tensor, targeting expert `i`, with length indicated by input splits (in `in_out_splits[0]`).

That is, the 2D AllToAllv shuffle achieves a transpose from rank-major order at input to expert-major order at output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155058
Approved by: https://github.com/ngimel
ghstack dependencies: #153653, #153677
2025-06-06 17:35:39 +00:00
64436c38c9 [logs] Add autotuning data (#154771)
Summary: Add autotuning logging data to scuba/chrome trace.

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run //scripts/sashko:compilation_sample
```

Open https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/viewer?local_cache_key=00000000-0000-0000-92db-f23383ebf5b5, search for template_autotuning, see in metadata strides (see screenshot)

Differential Revision: D75457770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154771
Approved by: https://github.com/masnesral, https://github.com/PaulZhang12
2025-06-06 17:12:55 +00:00
706bc41c4c pass mempool arg through emptyCache (#155315)
Fixing typo in a previous PR #154746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155315
Approved by: https://github.com/Skylion007
2025-06-06 16:14:26 +00:00
7ae7c14143 Reduce scope of s390x CI (#155208)
The purpose of this change is to reduce scope of s390x CI to stop it potentially blocking usual workflows for other users
while still keeping nightly builds and tests for me to look at.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155208
Approved by: https://github.com/malfet
2025-06-06 16:07:34 +00:00
fc77269262 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
2025-06-06 15:48:00 +00:00
7e4c097b07 Revert "[inductor] Add typing to _inductor/ir.py (#149958)"
This reverts commit 529e0357c6c4e74f8cd32c29198c5f1c9f6e329d.

Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see b0fbbef136/1 ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))
2025-06-06 15:19:16 +00:00
b0fbbef136 Revert "Turn on new tiling by default (#154768)"
This reverts commit 7dcc77e422dcf97ce35991a138ab635a5cb88731.

Reverted https://github.com/pytorch/pytorch/pull/154768 on behalf of https://github.com/malfet due to Looks like it broke inductor CPU, see 231eb9902b/1 ([comment](https://github.com/pytorch/pytorch/pull/154768#issuecomment-2949468396))
2025-06-06 14:40:03 +00:00
231eb9902b [MPS][BE] Extend ndim_and_dtypes to 4 elements (#155272)
Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running
any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to
exception
```
% MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled
test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ...
validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).'
zsh: abort      MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
```

Extend it to 4 elements and pass output dtype, which will be used by
binary_op later on anyway

Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make
sure everything passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272
Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever
2025-06-06 14:20:21 +00:00
529e0357c6 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-06 14:15:01 +00:00
348fd45065 Support detached checkout in tools/nightly.py (#154314)
Prompt for Sonnet 3.7 in Claude Code: Only inspect tools/nightly.py, all
other files are irrelevant to your task. Do not use any shell commands.
Task: Add a --detach argument to this script which instead of making a
new branch just directly checks out the correct commit in detached mode.

With two interventions:
- Branch and detach are mutually exclusive. So you should consolidate
  them into a single argument. Why don't we take over the 'None' option?
- Do you know that nightly_version is guaranteed to be a commit hash? It
  seems it would be safer to explicitly pass --detach

I tested by running `python tools/nightly.py checkout` and observing
that my worktree was detached at this point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154314
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
2025-06-06 13:28:29 +00:00
907aea032d Add claude local md files (#155299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155299
Approved by: https://github.com/ezyang
2025-06-06 13:28:26 +00:00
6b1211df29 [BE]: Backport runtime_checkable perf improvements/behavior from 3.12 (#155130)
Backports some behavior changes and performance improvements with runtime_checkable in 3.12 to older versions of Python. Should be free performance improvement on typing checking protocols since everything works on Python 3.12.

The difference between the two versions of runtime_checkable is [these lines](40e22ebb2c/src/typing_extensions.py (L800-L823)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155130
Approved by: https://github.com/rec, https://github.com/aorenste
2025-06-06 13:28:05 +00:00
10cef1e25d Remove torch XPU ABI=0 build logic for old compiler (#150095)
# Motivation
Follow https://github.com/pytorch/pytorch/pull/149888, this PR intends to remove ABI=0 build logic for PyTorch XPU build with old compiler (< 2025.0). For newer compilers >= 2025.0, the ABI is neutral by default without requiring additional compilation options (`-fpreview-breaking-changes`).

# Additional Context
This PR depends on XPU CI pass, which will be fixed by  https://github.com/pytorch/pytorch/pull/149843 and https://github.com/intel/torch-xpu-ops/pull/1515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150095
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-06-06 13:13:19 +00:00
58e5d20c57 [BE] Delete IS_SPMM_AVAILABLE() logic (#155296)
As it's been available on all currently supported platforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155296
Approved by: https://github.com/clee2000
2025-06-06 13:12:35 +00:00
271ca679a8 [reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)
reland of https://github.com/pytorch/pytorch/pull/154769

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-06-06 13:11:03 +00:00
9656251bb1 Revert "[BE] Update cudnn to 9.10.1.4 (#155122)"
This reverts commit a14f427db68e54500ef4cd9ed34cb9537263bb74.

Reverted https://github.com/pytorch/pytorch/pull/155122 on behalf of https://github.com/malfet due to Looks like it breaks a bunch of tests, see 36a722e20d/1 ([comment](https://github.com/pytorch/pytorch/pull/155122#issuecomment-2949209801))
2025-06-06 13:03:49 +00:00
36a722e20d [typo] Fix 'intialize' -> 'initialize' in proxy_tensor.py (#155301)
## Description
Fixes a typo in the comment of `torch/fx/experimental/proxy_tensor.py`, changing "intialize" to "initialize".

## Issue
None

## Type of change
- [x] Typo fix

## Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] My changes generate no new warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155301
Approved by: https://github.com/jingsh, https://github.com/ezyang, https://github.com/cyyever
2025-06-06 10:43:44 +00:00
9d59b516e9 Make device check throw specific error (#155085)
Fixes #122757

The fix is lost after revert and rebase previous PR https://github.com/pytorch/pytorch/pull/150750 (only change of tests are merged).

## Test Result

```python
>>> import torch
>>>
>>> model_output = torch.randn(10, 5).cuda()
>>> labels = torch.randint(0, 5, (10,)).cuda()
>>> weights = torch.randn(5)
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
>>> loss = loss_fn(input=model_output, target=labels)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3476, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155085
Approved by: https://github.com/mikaylagawarecki
2025-06-06 07:00:04 +00:00
07da8a469b [CI] fix xpu-smi hang issue on some xpu runners (#155194)
To workaround  xpu-smi hang issue on some XPU runners, refer https://github.com/pytorch/pytorch/actions/runs/15431583674/job/43431289026?pr=154962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155194
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-06-06 06:51:26 +00:00
e694280d12 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-06 06:49:44 +00:00
c6b4f98625 Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-06 05:53:24 +00:00
d3d64c6db0 Revert "Add pinned numpy and fix build (#155129)"
This reverts commit a3098a74d494020dbb906c05ef047013e1921662.

Reverted https://github.com/pytorch/pytorch/pull/155129 on behalf of https://github.com/malfet due to Broke test_spectral_op, looks like missing xfail, see 0db3e0cf29/1 ([comment](https://github.com/pytorch/pytorch/pull/155129#issuecomment-2947951632))
2025-06-06 03:14:47 +00:00
0db3e0cf29 Revert "Add Intel GPU info collection to the collect env script (#137846)"
This reverts commit e1180c7228ba8c8b16cabf78706d4a67ca189a6b.

Reverted https://github.com/pytorch/pytorch/pull/137846 on behalf of https://github.com/malfet due to Breaks doc test, but should be easily fixable ([comment](https://github.com/pytorch/pytorch/pull/137846#issuecomment-2947935940))
2025-06-06 03:08:48 +00:00
28796f71d0 Redo D75092426: [internal] Expose additional metadata to compilation callbacks (#155063)
Originally https://github.com/pytorch/pytorch/pull/153596
---------------

Summary:
via reverting D75708685

gate the ROCm failure

Test Plan:
Unit tests in OSS, sandcastle

Rollback Plan:

Bifferential Revision: D75894349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155063
Approved by: https://github.com/masnesral
2025-06-05 23:40:31 +00:00
72453a6676 [PT2][comms] put visualize_overlap in a try-except block (#155222)
Summary:
For simple FSDP, this `visualize_overlap` function is throwing errors.

Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places.

Test Plan:
:)

Rollback Plan:

Reviewed By: Microve

Bifferential Revision: D75985733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222
Approved by: https://github.com/yf225
2025-06-05 23:39:48 +00:00
9bae2fcf99 [profiler] Enable all configured activities in CUPTI Range profiler mode (#154749)
Summary: Updates the  pytorch range profiler mode (metrics mode) to support all trace activitity types.

Reviewed By: sraikund16

Bifferential Revision: D75568693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154749
Approved by: https://github.com/sraikund16
2025-06-05 23:38:54 +00:00
26f066bb61 Add AOTI model name config (#154129)
Summary: If a model name is specified in aoti config, the generated files will use that model name as file stem.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_using_model_name_for_files
```

Bifferential Revision: D75102034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154129
Approved by: https://github.com/desertfire
2025-06-05 23:38:11 +00:00
fa705f7912 [BE] minor refactor + some comments on behavior (#154695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154695
Approved by: https://github.com/masnesral, https://github.com/eellison
2025-06-05 23:00:46 +00:00
9e88d6c857 [ROCm] manywheel missing hipsparselt deps (#155254)
Bundle libhipsparselt.so and auxiliary files into wheel.

Dependency added by hipsparselt integration #150578.

Fixes #155207.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155254
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-05 22:45:36 +00:00
e1180c7228 Add Intel GPU info collection to the collect env script (#137846)
As title, add Intel GPU info collection to the collect env script

Output examples:
1. CPU on Windows
```
C:\Users\user\miniforge3\envs\py310\lib\site-packages\torch\_subclasses\functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: 12th Gen Intel(R) Core(TM) i7-1270P
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 1711
MaxClockSpeed: 2200
L2CacheSize: 9216
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] torch                     2.8.0.dev20250528+cpu          pypi_0    pypi
```

2. XPU on Windows
```
Collecting environment information...
PyTorch version: 2.8.0a0+gitef6306e
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro (10.0.19045 64-bit)
GCC version: (GCC) 13.1.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: N/A

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:06:35) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* 32.0.101.6795 (20250520000000.******+***)
Intel GPU models onboard:
* Intel(R) Arc(TM) A770 Graphics
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.33184', total_memory=15915MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2401
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767
----------------------
Name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
Manufacturer: GenuineIntel
Family: 179
Architecture: 9
ProcessorType: 3
DeviceID: CPU1
CurrentClockSpeed: 2200
MaxClockSpeed: 2401
L2CacheSize: 24576
L2CacheSpeed: None
Revision: 21767

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.8.10+gitb3ea3a1
[pip3] numpy==2.1.2
[pip3] optree==0.13.1
[pip3] pytorch-triton-xpu==3.3.1+gitb0e26b73
[pip3] torch==2.8.0a0+gitef6306e
[conda] intel-extension-for-pytorch 2.8.10+gitb3ea3a1          pypi_0    pypi
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] mkl-dpcpp                 2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-datafitting   2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-stats         2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-vm            2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.1+gitb0e26b73          pypi_0    pypi
[conda] torch                     2.8.0a0+gitef6306e          pypi_0    pypi
```

3. CPU on Linux
```
/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
Collecting environment information...
PyTorch version: 2.8.0.dev20250528+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64)
GCC version: (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7)
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.28                                                                                                                                                                                                                                                                                                Python version: 3.12.10 (main, Apr 19 2025, 05:03:56) [GCC 14.2.1 20250110 (Red Hat 14.2.1-7)] (64-bit runtime)                                             Python platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              88
On-line CPU(s) list: 0-87
Thread(s) per core:  2
Core(s) per socket:  22
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz
Stepping:            7
CPU MHz:             1000.000
CPU max MHz:         3700.0000
CPU min MHz:         1000.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            30976K
NUMA node0 CPU(s):   0-21,44-65
NUMA node1 CPU(s):   22-43,66-87
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] torch==2.8.0.dev20250528+cpu
[conda] Could not collect
```

5. XPU on Linux
```
Collecting environment information...
PyTorch version: 2.8.0.dev20250516+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.50-051550-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250101
Intel GPU driver version:
* intel_opencl: 24.39.31294.21-1032~22.04
* level_zero:   1.17.44.0-1022~22.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [2] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [3] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [4] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [5] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [6] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [7] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.31294+21', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          224
On-line CPU(s) list:             0-223
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8480+
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       5.3 MiB (112 instances)
L1i cache:                       3.5 MiB (112 instances)
L2 cache:                        224 MiB (112 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.5
[pip3] pytorch-triton-xpu==3.3.0+git0bcc8265
[pip3] torch==2.8.0.dev20250516+xpu
[conda] mkl                       2025.1.0                 pypi_0    pypi
[conda] numpy                     2.2.5                    pypi_0    pypi
[conda] onemkl-sycl-blas          2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-dft           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-lapack        2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-rng           2025.1.0                 pypi_0    pypi
[conda] onemkl-sycl-sparse        2025.1.0                 pypi_0    pypi
[conda] pytorch-triton-xpu        3.3.0+git0bcc8265          pypi_0    pypi
[conda] torch                     2.8.0.dev20250516+xpu          pypi_0    pypi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137846
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-05 22:35:04 +00:00
0a092c7de6 Enable CPP Extension Open Registration tests on Arm (#144774)
Enables most tests under CPP Extension Open Registration as they pass on Arm now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144774
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet
2025-06-05 22:32:28 +00:00
0827464002 Replace runtime type parameterization (#155221)
See:

```
>>> import timeit; print(f"OrderedSet[str](): {timeit.timeit('OrderedSet[str]()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s, OrderedSet(): {timeit.timeit('OrderedSet()', setup='from torch.utils._ordered_set import OrderedSet', number=1000000):.6f}s")
```
> `OrderedSet[str]()`: 0.354622s, OrderedSet(): 0.095376s

Type parameterization should be on type hint, not in runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155221
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-06-05 21:43:54 +00:00
7dcc77e422 Turn on new tiling by default (#154768)
Turning on in fbcode to come. Also updates `max_tiles` to have a default value of None. The existing tiling logic doesn't really handle max_tiles=3 well, but we do in the new tiling logic, so we default to 3 in the new logic and 2 elsewhere unless max_tiles has been explicitly set.

TB runners have been very unstable recently (do we need to bump batch size ?) but e.g. for a [recent torchbench](https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Tue,%2027%20May%202025%2015:38:26%20GMT&stopTime=Tue,%2003%20Jun%202025%2015:38:26%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/803/head&lCommit=8480c220db4eb3c9e2b58d85a698d0a7113a6e37&rBranch=main&rCommit=0cd18ba1ca35d87916723d445c06664615dcae12) inference run we had 15 models with a lower execution time (i.g. green) and 2 models with higher (i.e.. red)

I am doing another run and will update here.

Dynamic shapes is not yet turned on because there are a lot of fixes to be done in splitting that don't work yet.. See:
```
(Pdb) p expr
((s25*s85)//32)
(Pdb) p FloorDiv(expr, expr)
((s25*s85)//(32*(((s25*s85)//32))))
```

and also - unbacked shape is not multiple of itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154768
Approved by: https://github.com/jansel
2025-06-05 21:34:09 +00:00
a85ad55525 [ROCm][Windows] Fix offload gpu arch list in tests (#155212)
Added fix to get ROCM_PROPERTY_ARCH_LIST value in set_target_properties in c10/cuda and caffe2 tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155212
Approved by: https://github.com/malfet
2025-06-05 20:30:28 +00:00
9a42f01586 [Cutlass] EVT dynamic shapes support (#154835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #154829
2025-06-05 20:17:01 +00:00
5911f870c0 [Cutlass] fp8 dynamic shapes test (#154829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-06-05 20:17:01 +00:00
606d73bde4 Adding from_node for nodes in gm.module() (#155053)
Summary:
Adding "from_node" information that indicates which nodes are unlifted in `.module()` call.
The lifted nodes will have "ExportedProgram.module().unlift()" passname in the last entry of from_node.

Test Plan:
```
buck run fbcode//caffe2/test:test_export -- -r test_from_node_metadata_export
```

Rollback Plan:

Reviewed By: angelayi

Differential Revision: D75837494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155053
Approved by: https://github.com/angelayi
2025-06-05 20:11:56 +00:00
c8c892b4a5 [scan] disable functionalization key in backward tracing (#154343)
Previously, we didn't disable functionalization key when materializing backward graph. This causes the torch.zeros_like call for the case where grad is None to return a functional tensor that's not tracked by the proxy tensor mode.

This PR fixes it by putting the tracing code under disable functionalization ctx manager.

Fixes https://github.com/pytorch/pytorch/issues/153437.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154343
Approved by: https://github.com/zou3519
2025-06-05 20:06:33 +00:00
5e93abe3c0 Address docs for clip_grad functions (#155125)
This PR takes the opinionated stance that `torch.nn.utils.<func>` should be the preferred API over `torch.nn.utils.clip_grad.<func>`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155125
Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki, https://github.com/janeyx99
2025-06-05 19:22:09 +00:00
dd41a3907c [MPS] Fix unary/binary ops for 2**32+ elem tensors (#155183)
By using `TensorIterator::with_32bit_indexing()` primitive

Add `bind_tensors` helper function that correctly sets up MPS tensors originating from TensorIterator

TODO: Add comments to bind_tensors as well asunit test, based on
```
python  -c "import torch;print((torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps')).sin())"
```

Fixes https://github.com/pytorch/pytorch/issues/154828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155183
Approved by: https://github.com/cyyever, https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #155150, #155178, #155184
2025-06-05 18:57:14 +00:00
05dd638ee9 Revert "Add dont constant fold flag (#154945)"
This reverts commit 196c95d463367f15999c0cddc9eb89031e9988ab.

Reverted https://github.com/pytorch/pytorch/pull/154945 on behalf of https://github.com/malfet due to This broke halide test sanity, see a3098a74d4/1 ([comment](https://github.com/pytorch/pytorch/pull/154945#issuecomment-2945598901))
2025-06-05 18:25:59 +00:00
a3098a74d4 Add pinned numpy and fix build (#155129)
Not sure why the online doc build passes but it fails locally with these broken strings...

Also pinning numpy version even though it is technically optional to ensure users have the right version as most users have numpy in their environment anyways.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155129
Approved by: https://github.com/janeyx99, https://github.com/svekars
2025-06-05 17:44:18 +00:00
2481c4b2ea [cutlass backend] add teraflops and increase rep for benchmark script (#154944)
Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/)

I think I will continue to use do_bench for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944
Approved by: https://github.com/mlazos
2025-06-05 17:20:29 +00:00
be2ab96347 Inductor unit tests: cuda 12.6 -> 12.8 (#155056)
Fixes #154938

When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8

Regarding the revert & reland:
* This PR causes the python 3.13 version to be bumped from 3.13.2 to 3.13.3. test_deopt_from_append_list starts unexpectedly passing on 3.13.3, so I originally modified the test in https://github.com/pytorch/pytorch/pull/155167 to xfail only for <=3.13.2
* However there was a land race with https://github.com/pytorch/pytorch/pull/150796, which introduced another test that passes only for >=3.13.3.

Resolution:
* @guilhermeleobas reverted https://github.com/pytorch/pytorch/pull/150796 so I will reland this (and I've merged the test_deopt_from_append_list change into this PR. And based on Guilherme's feedback, I'm just skipping the test instead of selectively failing/passing the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2025-06-05 17:17:27 +00:00
cadcb5d368 [inductor] disable compiler on the compiled_module_main (#155169)
Fixes https://github.com/pytorch/pytorch/issues/154536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155169
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-06-05 16:37:45 +00:00
13ea0f2c0a [dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)
For program like this

```
class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c = 0

    def forward(self, x):
        self.c += 1
        return x * self.c
```

You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867
Approved by: https://github.com/zou3519
2025-06-05 16:37:22 +00:00
a14f427db6 [BE] Update cudnn to 9.10.1.4 (#155122)
Follow up to #152782
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155122
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-05 16:07:25 +00:00
cd361fc247 [CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) (#154437)
Fixes https://github.com/pytorch/pytorch/issues/154157

Inductor Workflows where moved from focal to jammy here: https://github.com/pytorch/pytorch/pull/154153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154437
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/davidberard98, https://github.com/huydhn
2025-06-05 15:24:07 +00:00
e895e9689c Update docs build to specify <3.13 in CONTRIBUTING (#155140)
Python 3.13 removed the deprecated imghdr module, so our docs build does not compile with 3.13+. Mention it in our contributing guide so people know before committing to the wrong version oop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155140
Approved by: https://github.com/drisspg, https://github.com/cyyever
ghstack dependencies: #155126
2025-06-05 15:16:48 +00:00
2f3f8339ec [BE] Document device memory apis in correct module (#155126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155126
Approved by: https://github.com/msaroufim, https://github.com/Skylion007
2025-06-05 15:16:48 +00:00
7999735d23 [CUDA][MPS] Fix torch.arange bound validation for large float inputs (#154320)
Fixes #153133

Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic.

The fix introduces double precision validation for checking whether the step sign is consistent with the range direction.

This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320
Approved by: https://github.com/malfet
2025-06-05 14:51:25 +00:00
ed661a5f11 [MPS] Fix complex scalar binding to Metal tensors (#155184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155184
Approved by: https://github.com/dcci
ghstack dependencies: #155150, #155178
2025-06-05 14:34:57 +00:00
9bf6593e96 Fix docstring for torch.UntypedStorage.from_file (#155067)
Fixes #130629

Happy to revert the second commit if we think it's making the test too fragile for the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155067
Approved by: https://github.com/malfet
2025-06-05 14:30:49 +00:00
a1057cda31 Revert "Add CPython generator/contextlib tests (#150796)"
This reverts commit d5f642211f14593c8c78af98a1fb7cfb63039ce5.

Reverted https://github.com/pytorch/pytorch/pull/150796 on behalf of https://github.com/guilhermeleobas due to This is breaking tests on trunk. https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=3.13&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/150796#issuecomment-2944469866))
2025-06-05 13:51:54 +00:00
196c95d463 Add dont constant fold flag (#154945)
For support https://github.com/pytorch/ao/issues/2228
> What we want to do now is to enable FP8 quantization in PyTorch. And similar as INT8 quantization, we need to insert quantize and dequantize ops into the graph.
>
> However we met problems with these q/dq ops both in the PyTorch core and Torchao.
>
> PyTorch core:
>
> The quantize_per_tensor op does not support FP8. We want to fix it via https://github.com/pytorch/pytorch/pull/153601. And as you commented, the op is deprecated.
> Torchao:
>
> In the fusion pass in Inductor, we want to match the pattern fp8_weight -> torchao.dequantize_affine_float8 -> fp32_op and fuse it as fp8_weight -> weight_pack -> fp8_op. We have done so for INT8 PT2E quantization. However, the pattern matching pass is applied after a constant folding pass in Inductor:
> 100ec0b34a/torch/_inductor/fx_passes/freezing_patterns.py (L69C1-L74C1)
> After constant_fold(gm), the pattern will be folded as fp32_weight -> fp32_op. Then the original pattern cannot be found any more and the FP8 semantics is lost since the pattern is entirely in fp32 now.
> For INT8, the int8_weight -> quantized_decomposed.dequantize_per_channel -> fp32_op pattern won't be folded because we mark quantized_decomposed.dequantize_per_channel impure so that it won't be folded: 100ec0b34a/torch/_inductor/constant_folding.py (L139C1-L149C1) . But for the torchao.dequantize_affine_float8, we cannot do this because
> It is an op from Torchao, which is unknown to the constant folder
> It is decomposed to smaller ops, so we cannot put it in the list as a single op.
> So, we think an easy and short-term solution is to modify the ops in PyTorch core via https://github.com/pytorch/pytorch/pull/153601.
> However, if we want to resolve the issue with Torchao, we need to
> Add a method in the constant folder in Inductor to allow registration of impure ops

Based on [Jansel‘s reply](https://github.com/pytorch/ao/issues/2228#issuecomment-2914560340), add dont constant fold flag on this patch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154945
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-05 13:42:44 +00:00
e01fde8213 Revert "[reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)"
This reverts commit bee9c70c5d4b681ec1f2adf92eca1205b372634a.

Reverted https://github.com/pytorch/pytorch/pull/154974 on behalf of https://github.com/malfet due to Broke inductor tests, see 3c72b9fd8f/1 ([comment](https://github.com/pytorch/pytorch/pull/154974#issuecomment-2944370617))
2025-06-05 13:36:21 +00:00
3c72b9fd8f Revert "SDPA support gfx950 (#155103)"
This reverts commit b9312c56bf5f277e341c0185da748e3475d0807f.

Reverted https://github.com/pytorch/pytorch/pull/155103 on behalf of https://github.com/malfet due to looks like it broke mi300 tests, see 9a4c08ddfc/1 ([comment](https://github.com/pytorch/pytorch/pull/155103#issuecomment-2944331460))
2025-06-05 13:33:17 +00:00
523b637cbe Revert "[test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167)"
This reverts commit 1c828786c28b8cd2a6be2397cc2af65e3266c5fa.

Reverted https://github.com/pytorch/pytorch/pull/155167 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see fa3c38c7ae/1 ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))
2025-06-05 13:27:40 +00:00
f60b2712dd Revert "Inductor unit tests: cuda 12.6 -> 12.8 (#155056)"
This reverts commit bb43ced6e2c9e1cdc17923826aaf58466c2ffd4b.

Reverted https://github.com/pytorch/pytorch/pull/155056 on behalf of https://github.com/malfet due to This broke a bunch of 3.13 tests, see fa3c38c7ae/1 ([comment](https://github.com/pytorch/pytorch/pull/155167#issuecomment-2944318067))
2025-06-05 13:27:40 +00:00
9a4c08ddfc [MPS] Parametrize test_scaled_dot_product_attention_autocast (#155005)
Also moving comments inside the function scope for some of my previous regression tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155005
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-05 13:24:53 +00:00
fa3c38c7ae Add tensor overlap check for cross (#154999)
Fixes #132031

## Test Result

```python
In [1]: import torch
   ...: torch.manual_seed(0)
   ...: torch.cuda.manual_seed(0)
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(3, 4)
   ...: torch.cross(a, b, out=a)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
      4 a = torch.randn(3, 4)
      5 b = torch.randn(3, 4)
----> 6 torch.cross(a, b, out=a)

RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999
Approved by: https://github.com/lezcano
2025-06-05 10:00:01 +00:00
5b65628906 Workflow to tag trunk commits with trunk/{commit-sha} tags (#155170)
This PR adds workflow to automate tagging commits on the `main` branch. The workflow includes validation and retry with exponential backoff.

The rationale for this is to work around the github limitation on using workflow_dispatch (requires branch or tag). We want to use workflow_dispatch to rerun CI workflows with parameters (trunk, pull, etc).

---

### Testing

Tested using almost identical workflow in a personal repo (the difference is in repository_owner check and backoff settings).

* successful tag push:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454729765/job/43504630765

* validation: PR commit (fails)
   https://github.com/izaitsevfb/deleteme/actions/runs/15454743572/job/43504669720

* tagging of the old commit on main:
   https://github.com/izaitsevfb/deleteme/actions/runs/15453805748/job/43501885903

* tag already exists:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454756077/job/43504706980

* invalid sha on workflow dispatch:
   https://github.com/izaitsevfb/deleteme/actions/runs/15454611077/job/43504286858

* retry with exponential backoff on failure (via tag rule blocklist):
   https://github.com/izaitsevfb/deleteme/actions/runs/15454768346/job/43504743486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155170
Approved by: https://github.com/huydhn
2025-06-05 09:50:58 +00:00
bee9c70c5d [reland][dynamo] Record the pre-graph bytecode using fast record function event (#154974)
reland of https://github.com/pytorch/pytorch/pull/154769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154974
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-06-05 07:25:04 +00:00
be16f21ca6 [Graph Partition] add symints to get_graph_inputs (#154679)
During `codegen_inputs`, we check whether there are undefined symbols:
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)

Previously, for graph partition inputs, we do not explicitly add symints.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)
We relied on sizes/strides of TensorBox for codegen symint inputs.  For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)

In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2*s0, 2]`. Since `2*s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`.

The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679
Approved by: https://github.com/eellison
2025-06-05 06:46:28 +00:00
d3c8f36ba0 Revert "[Intel GPU] Make SDPA output has the same stride as Query. (#154340)"
This reverts commit 0f10df71a66cb1b0c3659381b7db8e06d95f0d67.

Reverted https://github.com/pytorch/pytorch/pull/154340 on behalf of https://github.com/etaf due to This PR breaks hugging face E2E run on XPU. ([comment](https://github.com/pytorch/pytorch/pull/154340#issuecomment-2942954192))
2025-06-05 06:46:24 +00:00
bb43ced6e2 Inductor unit tests: cuda 12.6 -> 12.8 (#155056)
Fixes #154938

When we update the Triton version in CI, we'll require cuda >= 12.8 for certain AOTI tests to pass: these AOTI tests try to run nvcc on the triton-generated PTX, and triton-generated PTX is PTX 8.7, which requires CUDA 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155056
Approved by: https://github.com/atalman, https://github.com/nWEIdia
ghstack dependencies: #155167
2025-06-05 05:59:06 +00:00
1c828786c2 [test][dynamo] skip test_deopt_from_append_list on python>=3.13.3 (#155167)
Not sure why, apparently this test starts passing on python 3.13.3 (while it fails on python <=3.13.2) and it's causing unexpected passes on xfail-ed tests when newer versions of python are used, e.g. in #155056.

Verified locally in a python 3.13.1 vs. python 3.13.3 conda env.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155167
Approved by: https://github.com/williamwen42
2025-06-05 05:59:06 +00:00
93012d2290 Revert "[forward fix] add support for MemoryFormat after type tightening (#154658)"
This reverts commit 0fdd568b785812da86e69d65632de77d2ee945c7.

Reverted https://github.com/pytorch/pytorch/pull/154658 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154658#issuecomment-2942752048))
2025-06-05 05:01:40 +00:00
5130ac64f4 Revert "Add randint_like tensor overload for high (#154899)"
This reverts commit 72fe1d5f42aa9bffa876932a3b4fcae052b99168.

Reverted https://github.com/pytorch/pytorch/pull/154899 on behalf of https://github.com/seemethere due to Failing internal tests see https://fburl.com/diff/bai044ob ([comment](https://github.com/pytorch/pytorch/pull/154899#issuecomment-2942740661))
2025-06-05 04:54:05 +00:00
80703ca332 [FlexAttention] Allow dispatch to SAC for flex (#150080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150080
Approved by: https://github.com/zou3519
2025-06-05 04:34:27 +00:00
fa63de0866 Handle empty linemaps in PyCodeCache (#155064)
Some functions have empty linemaps, and if you call `PyCodeCache.stack_frames_for_code` on code in the wrong order, you'll end up triggering a too many values to unpack issue: https://github.com/pytorch/pytorch/issues/154536

Specifically, if you populate PyCodeCache's linemap via caching, and then request the stack frames of a inductor generated output file that has an empty linemap, this function will try to unpack too many arguments.

Test plan:
```
import os

os.environ["TORCHINDUCTOR_FX_GRAPH_CACHE"] = "1"
os.environ["TORCHINDUCTOR_AUTOGRAD_CACHE"] = "1"

import torch

@torch.compile
def fn(x: torch.Tensor):
    (x_grad,) = torch.autograd.grad(x.sum(), x)
    return x_grad

x = torch.randn(10, 10, requires_grad=True)
result = fn(x)
```

Run this twice and see that everything works as expected.

It's hard to exactly pinpoint a good unit test for this: it requires a whole lot of moving parts to get the issue to trigger because:

- The callsite in question in dynamo, without caching, will always run before generating the code, so cls.linemaps[path] will be None most of the time
- The inductor generated output needs to call *back* into dynamo via `assert_size_stride`
- In our test case, the CompiledBackward needs to not have linemaps, and also be called in the middle of a graph break while compiling a different cached function. Caching switches the order the PyCodeCache.linemap is populated (i.e. either before or after the graph break is evaluated), which causes the issue.

All these things need to interact together to create the bug, so it's a bit difficult to write a simple unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155064
Approved by: https://github.com/bdhirsh
2025-06-05 03:54:35 +00:00
450180fbcd [c10d][fr] Add the log of thread name and thread id into fr (#155142)
There is an ask from internal head users to have thread id and thread name inside fr. This would be useful to users when it comes to cases when we launches collectives not just on main thread as well.

Differential Revision: [D75973919](https://our.internmc.facebook.com/intern/diff/D75973919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155142
Approved by: https://github.com/kwen2501
2025-06-05 03:33:01 +00:00
b9312c56bf SDPA support gfx950 (#155103)
Summary: Seems to run, just not the optimal performance. e.g. ck_tile doesn't have those gfx942 optimizations it seems https://github.com/ROCm/composable_kernel/blob/develop/include/ck_tile/ops/fmha/block/variants.hpp#L27

Test Plan:
```
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           179.737 |             182.874 |          3106.6  |          359.662 |              205.506 |        382.334 |          375.776 |       22.1205 |       191.067 |           334.392 |                17.2841 |                 16.9877 |               8.63754 |                   15.1169 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |           617.271 |             623.38  |          7169.73 |          998.961 |              654.534 |        445.312 |          440.947 |       38.3387 |       275.164 |           419.96  |                11.6152 |                 11.5014 |               7.17719 |                   10.9539 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |           667.032 |             670.118 |         13031.8  |         1383.42  |              768.452 |        412.091 |          410.193 |       21.0928 |       198.694 |           357.703 |                19.5371 |                 19.4471 |               9.42    |                   16.9586 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          2074.64  |            2214.81  |         29186.9  |         3916.35  |             2404.29  |        529.978 |          496.437 |       37.6714 |       280.749 |           457.313 |                14.0684 |                 13.1781 |               7.45257 |                   12.1395 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          2456.6   |            2472.38  |         51095.8  |         5647.01  |             3008.09  |        447.574 |          444.718 |       21.5186 |       194.707 |           365.518 |                20.7994 |                 20.6666 |               9.0483  |                   16.9861 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |          8048.8   |            8070.96  |        113478    |        15580.8   |             9768.71  |        546.423 |          544.922 |       38.7569 |       282.274 |           450.218 |                14.0987 |                 14.06   |               7.2832  |                   11.6165 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Mem Eff Time (µs) |   Math Time (µs) |   Flex Time (µs) |   xformers Time (µs) |   Flash TFlops |   Mem Eff TFlops |   Math TFlops |   Flex TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (MemEff/Math) |   Speedup (Flex/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+=====================+==================+==================+======================+================+==================+===============+===============+===================+========================+=========================+=======================+===========================+======================+===================+
|            1 |              4096 |      16 |         64 |           692.323 |             697.649 |          4241.81 |          1562.26 |              906.441 |        248.148 |          246.254 |       40.5012 |      109.968  |           189.531 |                6.12693 |                 6.08015 |               2.71518 |                   4.67963 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      32 |        128 |          2263.22  |            2267.38  |          9482.64 |          7003.8  |             2765.5   |        303.636 |          303.079 |       72.4687 |       98.1174 |           248.489 |                4.1899  |                 4.18221 |               1.35393 |                   3.42891 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |         64 |          2553.94  |            2572.68  |         15909.8  |          5697.16 |             3284.77  |        269.073 |          267.112 |       43.193  |      120.621  |           209.206 |                6.22953 |                 6.18415 |               2.79259 |                   4.84352 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |        128 |          8187.67  |            8201.71  |         35449.2  |         26424.3  |            10364.5   |        335.722 |          335.147 |       77.5413 |      104.025  |           265.21  |                4.32959 |                 4.32218 |               1.34154 |                   3.42025 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |         64 |          9948.15  |            9815.47  |         62815.1  |         23741.9  |            12710     |        276.31  |          280.046 |       43.7598 |      115.778  |           216.269 |                6.31425 |                 6.39961 |               2.64575 |                   4.94217 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |        128 |         32187.6   |           32035.6   |        137832    |        102075    |            40623.4   |        341.595 |          343.216 |       79.7716 |      107.716  |           270.66  |                4.28216 |                 4.30248 |               1.35031 |                   3.39293 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+---------------------+------------------+------------------+----------------------+----------------+------------------+---------------+---------------+-------------------+------------------------+-------------------------+-----------------------+---------------------------+----------------------+-------------------+

```

Rollback Plan:

Differential Revision: D75934358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155103
Approved by: https://github.com/yoyoyocmu
2025-06-05 03:26:38 +00:00
a01bb9da14 [CI][CUDA] Re-enable the test-nan-assert on CUDA12 (#154448)
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround #153479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448
Approved by: https://github.com/kwen2501
2025-06-05 02:09:31 +00:00
5e03433443 Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit e5afbe31245287a92fe328c404b3557e5c5eca73.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Broke rocm, see 642687af29/1 ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2942415600))
2025-06-05 01:38:13 +00:00
642687af29 [MPS][BE] Some refactor in preparation for 64-bit iterators (#155178)
set input/output tensors only once

Get rid of `is_storage_dense` predicate, as `iter.is_contiguous` serves the same purpose
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155178
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #155150
2025-06-05 01:24:31 +00:00
3398d1d459 support bmm and mm_plus_mm in generated templates cache (#154904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154904
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #154891, #154892
2025-06-05 00:36:01 +00:00
21f45f7afb Add CPython int/float tests (#150795)
Tests:
* test_int.py
* test_int_literal.py
* test_float.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_int" "test_int_literal" "test_float"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150795
Approved by: https://github.com/williamwen42
2025-06-05 00:28:53 +00:00
d5f642211f Add CPython generator/contextlib tests (#150796)
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py

Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.

The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:

```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
2025-06-05 00:18:29 +00:00
fb5a787a8f [HOP] Added clone for outputs of create_bw_fn that are aliasing the inputs (#153932)
This PR fixes an issue with the new way of creating the bw graph introduced for cond. In particular, there is an issue if the bw function simply aliases the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153932
Approved by: https://github.com/ydwu4
2025-06-04 23:52:52 +00:00
b0a2ca65ef support more prologue functions in generated templates cache (#154892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154892
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #154891
2025-06-04 23:45:36 +00:00
51b4c51973 add missing check for caching triton template caching (#154891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154891
Approved by: https://github.com/eellison
2025-06-04 23:45:36 +00:00
1083bc749d [Memory Snapshot] Add Flag to Toggle Global and Local Callbacks for Annotations (#154932)
Summary:
There are some cases where we want only local annotations for memory snapshot such as executing inside the cudastream callback, which cannot execute CUDA operators. Thus the cuda errors happen: Exception in RecordFunction callback: CUDA error: operation not permitted

However, we need to have an option to turn on the globally so that on-demand snapshot can get annotations. Additionally, there may be some cases in which auto-trace will also want annotations using record functions so we expose the flag to the auto-trace as well.

Test Plan:
Run MVAI executable and see that the errors go away

Rollback Plan:

Differential Revision: D75831687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154932
Approved by: https://github.com/mzzchy, https://github.com/sanrise
2025-06-04 23:15:19 +00:00
7cf5b36ec2 Release GIL in PG destructor (#154976)
Summary: Gloo PG doesn't release GIL, which results in python code hanging until the destructor completes. The destructor waits for all work on the PG to complete which can take a long time.

Test Plan: Ran

```
$ pytest --log-cli-level=INFO -vs torchft/local_sgd_integ_test.py
```

with a large timeout on the async work. Call to `gil_scoped_release` doesn't show up in the gdb stack trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154976
Approved by: https://github.com/d4l3k, https://github.com/dcci, https://github.com/fduwjj
2025-06-04 23:10:55 +00:00
c881f2ddf3 [reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099)
Reland of https://github.com/pytorch/pytorch/pull/154780

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099
Approved by: https://github.com/williamwen42
2025-06-04 23:05:36 +00:00
992be94dab [MPS][BE] Better error messages (#155150)
"Can't be indexed using 32-bit iterator" is not really helpful error
This PR distinguishes between error from old indexing helper function as well as to binaryTensorIterator
Adds the same warning to unary op, otherwise it just runs and returns incorrect value

Test plan (manual, don't have machine with enough RAM to run it reliable in CI):
```
%  python  -c "import torch;print(torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps'))"
RuntimeError: add can't be indexed using 32-bit iterator for shape [1048576, 5000]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155150
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-04 22:53:51 +00:00
f5e2e4c4f1 [Inductor] Include math and torch in launcher scope (#154673)
Summary:
For grid computation, if we have sympy, it is possible we have math and torch used.
We include the math and torch module in the launcher scope to make sure those grid get computed correctly.

Test Plan: Check phabricator for internal cmd.

Differential Revision: D75642931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154673
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-06-04 22:32:19 +00:00
671553bd23 Update documentation wording for transformer-related layers (#155123)
<img width="947" alt="Screenshot 2025-06-04 at 1 33 53 PM" src="https://github.com/user-attachments/assets/4dbb66b3-43f4-4d04-afb5-dc80cec0f2cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155123
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2025-06-04 22:20:32 +00:00
6f23ca53bb [dynamo] sample gb_registry json file for website testing purposes (#155160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155160
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-06-04 22:14:48 +00:00
c8566a0b98 [export] Use patching in test (#155132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155132
Approved by: https://github.com/pianpwk
2025-06-04 21:41:26 +00:00
65a5eb8d27 Fix for ambiguity in linalg.norm()'s ord argument of +2 & -2 (#155148)
Fixes #136453

### Description
---
Fixed the ambiguity by referencing a hyperlink to wikipedia's SVD/Singular Values section as per past discussion (by other contributors) on the above thread.

In the ord argument, for values `+2` and `-2`, the `singular value` now points to [this section of singular values on the wiki SVD page](https://en.wikipedia.org/wiki/Singular_value_decomposition#Singular_values,_singular_vectors,_and_their_relation_to_the_SVD).

### Why not mention SVD
---
For conciseness (expanding 'largest singular value' -> 'largest singular value of a SVD' is too much, i think, wrt rest of the table)

---

I hope this is satisfactory. Please let me know if I have missed anything essential; cheers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155148
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2025-06-04 21:15:20 +00:00
b084e1b81c [HOP] Rework Autograd DispatchKey for scan and map (#153336)
This PR introduces the `py_autograd_impl` instead of the `DispatchKey.Autograd` for some HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153336
Approved by: https://github.com/ydwu4
2025-06-04 20:54:02 +00:00
0404785f3b [dynamo] [3/3] added cmd_update_gb_type which supports updating an existing gb_type properties and optional arg to change gb_type name (#154985)
The user can now use the terminal to update the registry whenever they update an existing gb_type's properties. Additionally, if the user changes the gb_type description itself, they can update the registry as well.

Terminal command template for updating existing gb_type: python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite]

Terminal command template for updating existing gb_type name (can also be used if the user changed the other properties as well including the gb_type name): python [path to gb_id_mapping.py] update "existing_gb_type" [path to file where user added callsite] --new_gb_type "new_name_for_existing_gb_type"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154985
Approved by: https://github.com/williamwen42
2025-06-04 20:10:02 +00:00
e5afbe3124 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-04 20:03:46 +00:00
4d576442e9 Fix incorrect get_default_qat_qconfig in prepare_qat_fx docs. (#155100)
Fixes #144522

## Description

FX QAT docs for prepare_qat_fx incorrectly used get_default_qat_qconfig when it should use get_default_qat_qconfig_mapping for a qconfig_mapping.

Previous example code incorrectly used `get_default_qat_qconfig`, resulting in a qconfig being incorrectly
passed to `prepare_qat_fx`.    `prepare_qat_fx` requires  a `qconfig_mapping`, not a single `qconfig`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155100
Approved by: https://github.com/jerryzh168
2025-06-04 18:51:40 +00:00
6c8241c089 [dynamo] [2/3] added add_new_gb_type functionality (#154886)
The user can now use the terminal to update the registry whenever they create a new unimplemented_v2() callsite.
Terminal command template: python [path to gb_id_mapping.py] add "new_gb_type" [path to file where user added callsite]
Before the user added a new gb_type:
<img width="619" alt="Screenshot 2025-06-02 at 1 33 54 PM" src="https://github.com/user-attachments/assets/7258cab1-a184-4200-9d56-7b21d243d6d8" />
After the user added a new gb_type:
<img width="366" alt="Screenshot 2025-06-02 at 1 34 47 PM" src="https://github.com/user-attachments/assets/5c383e94-268c-4f6d-9111-7b18c856222e" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154886
Approved by: https://github.com/williamwen42
ghstack dependencies: #154738
2025-06-04 18:44:37 +00:00
681a8189d7 [dynamo] [1/3] updated gbid mapping for initial registry creation (#154738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154738
Approved by: https://github.com/williamwen42
2025-06-04 18:44:37 +00:00
197080337b [AOTI] Extend torchgen to generate C shim with version number (#147745)
Summary: While it is ok to add a new arg with defaul value to a fallback op in Python, it will be BC-breaking for the C shim. This PR adds an automatic approach to update C shim files when specifying a version number with a list of new args for the modified op. See https://github.com/pytorch/pytorch/pull/154848 as an example on how to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147745
Approved by: https://github.com/yushangdi
2025-06-04 18:40:34 +00:00
1d67849e43 [AOTInductor] Activate CPU test for package and update weights (#155078)
Summary:
looks like CPU is enabled for update_constant_buffer in D71177509

enable these tests as well.

Test Plan:
```
 buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_without_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_user_managed_weight" -v
buck2 test @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_update_weights" -v
```

Rollback Plan:

Differential Revision: D75908993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155078
Approved by: https://github.com/angelayi
2025-06-04 17:57:20 +00:00
956716880f [c10d][gloo] Enable using c10::Half for gloo (#153862)
Testing with https://github.com/pytorch/gloo/pull/446 and we see that the numerical issues reported in https://github.com/pytorch/pytorch/issues/152300 is indeed resolved and we added a unit test for it. Also update submodule gloo to reflect the change on the gloo side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153862
Approved by: https://github.com/d4l3k, https://github.com/clee2000, https://github.com/malfet
2025-06-04 17:53:08 +00:00
9eb7e67727 [PT2][memory] correct wait tensor output size (#153569)
This PR correctly handles the output buffer size of wait tensor nodes.
![image](https://github.com/user-attachments/assets/fdcc5eb7-58cf-42a2-84b2-ce949cb9db92)

See [this doc](https://docs.google.com/document/d/1lkKulwIb-fYL_p8jn1SD6Lh1PoAKBgpBsU5sAH80leI/edit?tab=t.0#bookmark=id.w3n4k1y4rdz8) with testing details [internal only]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153569
Approved by: https://github.com/eellison
2025-06-04 17:49:25 +00:00
34c6371d24 Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS (#154568)
NVSHMEM 3.2.5 (released Mar 2025) have both cu11 and cu12 builds.
See:
https://pypi.nvidia.com/nvidia-nvshmem-cu12/
https://pypi.nvidia.com/nvidia-nvshmem-cu11/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154568
Approved by: https://github.com/atalman
ghstack dependencies: #154538
2025-06-04 17:43:24 +00:00
b3e666ae17 [easy] Bump STATIC_CUDA_LAUNCHER_VERSION=1 (#154861)
This turns on STATIC_CUDA_LAUNCHER internally for a some low risk entitlements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154861
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-06-04 17:38:06 +00:00
e9c31fb86d [torch.compile] handle a custom __delattr__ method correctly (#150899)
Fixes #150765
- handle a custom __delattr__ method correctly

Test:
```
import torch

class MyObject:
    def __init__(self, val):
        self.val = val
        # Flag to track deletion attempts instead of using print
        self.deletion_attempted = False

    def __delattr__(self, attr):
        if attr == "val":
            # Set flag instead of printing
            self.deletion_attempted = True
        else:
            super().__delattr__(attr)

@torch.compile(fullgraph=True, backend="eager")
def test(input_tensor):
    instance_a = MyObject(1)
    instance_b = MyObject(2)

    del instance_a.val
    del instance_b.val
    exists_a = hasattr(instance_a, 'val')
    exists_b = hasattr(instance_b, 'val')
    deletion_attempted_a = instance_a.deletion_attempted
    deletion_attempted_b = instance_b.deletion_attempted

    return input_tensor + 1, exists_a, exists_b, deletion_attempted_a, deletion_attempted_b

# Run the test
result = test(torch.ones(1))
print(f"Result tensor: {result[0]}")
print(f"val attribute still exists on instance_a: {result[1]}")
print(f"val attribute still exists on instance_b: {result[2]}")
print(f"Deletion was attempted on instance_a: {result[3]}")
print(f"Deletion was attempted on instance_b: {result[4]}")

```

output:
```
(base) sany@sandishs-Laptop pytorch % python3 test_delattr_fix.py
Result tensor: tensor([2.])
val attribute still exists on instance_a: True
val attribute still exists on instance_b: True
Deletion was attempted on instance_a: True
Deletion was attempted on instance_b: True
```

```
(pytorch-dev) sany@sandishs-Laptop pytorch % python3 -m pytest test/dynamo/test_repros.py::ReproTests::test_delattr_return -v
========================================================= test session starts =========================================================
platform darwin -- Python 3.12.5, pytest-8.3.5, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.12/bin/python3
cachedir: .pytest_cache
rootdir: /Users/sany/git/pytorch
configfile: pytest.ini
plugins: typeguard-4.3.0
collected 1 item
Running 1 items in this shard

test/dynamo/test_repros.py::ReproTests::test_delattr_return PASSED [0.0659s]                                                    [100%]

========================================================== 1 passed in 1.71s ==========================================================
(pytorch-dev) sany@sandishs-Laptop pytorch %
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150899
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-06-04 17:27:20 +00:00
4405dc1487 Revert "Always set CPU affinity for benchmark jobs (#154569)"
This reverts commit 629fca295e1257c2c54d1b6316ed4fa00e6044d6.

Reverted https://github.com/pytorch/pytorch/pull/154569 on behalf of https://github.com/anijain2305 due to potentially causing compile time regressions, unsure ([comment](https://github.com/pytorch/pytorch/pull/154569#issuecomment-2940737778))
2025-06-04 16:52:15 +00:00
8f08f90b61 Bump pillow from 10.0.1 to 10.3.0 in /.github/requirements (#154416)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.0.1 to 10.3.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/10.0.1...10.3.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 10.3.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-04 09:37:13 -07:00
aed938f3a8 Enable check_gomp for Ubuntu OSes (#155119)
And ARM platform
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155119
Approved by: https://github.com/atalman
2025-06-04 15:57:08 +00:00
20912673a6 Revert "Add __main__ guards to jit tests (#154725)"
This reverts commit 1a55fb0ee87eaa8b376aaa82d95d213fe0fbe64b.

Reverted https://github.com/pytorch/pytorch/pull/154725 on behalf of https://github.com/malfet due to This added 2nd copy of raise_on_run to common_utils.py which caused lint failures, see https://github.com/pytorch/pytorch/actions/runs/15445374980/job/43473457466 ([comment](https://github.com/pytorch/pytorch/pull/154725#issuecomment-2940503905))
2025-06-04 15:42:52 +00:00
6f93ce3c86 Revert "[Cutlass] fp8 dynamic shapes test (#154829)"
This reverts commit 36596ad2a009a0906848fa264954d4b200efc50e.

Reverted https://github.com/pytorch/pytorch/pull/154829 on behalf of https://github.com/seemethere due to This is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154829#issuecomment-2940494361))
2025-06-04 15:36:27 +00:00
3fa3dbdb1f Revert "[Cutlass] EVT dynamic shapes support (#154835)"
This reverts commit 4224a7df01a9607830da771fd4884c8eba150630.

Reverted https://github.com/pytorch/pytorch/pull/154835 on behalf of https://github.com/seemethere due to This is part of a stack that is failing internal tests see, [fburl.com/diff/3gomp7i3](https://fburl.com/diff/3gomp7i3). Please re-land this as a co-dev diff ([comment](https://github.com/pytorch/pytorch/pull/154835#issuecomment-2940463211))
2025-06-04 15:33:09 +00:00
3ce5102927 [ROCm] fix CI failures from inductor periodic (#154896)
Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-04 15:28:43 +00:00
a99a01a677 Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)"
This reverts commit cc96febb979da16b0a0b758020b330a49c72b7e7.

Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
a0f2544502 Revert "[dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)"
This reverts commit 6c2f941e250ba34a920f476c8a9ee30e6153fc15.

Reverted https://github.com/pytorch/pytorch/pull/154867 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
1a55fb0ee8 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/Skylion007
2025-06-04 14:44:08 +00:00
3f34d26040 Add __main__ guards to distributed tests (#154628)
This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs.

In distributed tests:

- Ensure all files which should call run_tests do call run_tests.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Cc @wconstab @clee2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628
Approved by: https://github.com/Skylion007
2025-06-04 14:39:57 +00:00
c8d44a2296 Add __main__ guards to fx tests (#154715)
This PR is part of a series attempting to re-submit #134592 as smaller PRs.

In fx tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154715
Approved by: https://github.com/Skylion007
2025-06-04 14:38:50 +00:00
cf9cad31df Add __main__ guards to tests (#154716)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

Add missing `if __name__ == "__main__":` guards to some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154716
Approved by: https://github.com/Skylion007
2025-06-04 14:38:13 +00:00
ca0c2985d3 [ONNX] Allow exporter to export SDPA to Attention onnx operator (#154596)
Fixes [#149662](https://github.com/pytorch/pytorch/issues/149662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154596
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-04 14:29:44 +00:00
31d12b3955 Fix avg_pool2d param kernel_size descripthon (#154353)
Fixes part of #153149

## Test Result

![image](https://github.com/user-attachments/assets/216ffd2b-dd2b-4cf6-9fca-aeed075be5e7)

![image](https://github.com/user-attachments/assets/820cd184-1f8e-4a7a-b64e-15dfb9c7dad2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154353
Approved by: https://github.com/colesbury
2025-06-04 11:55:01 +00:00
2af78d368f Skip another test file that doesn't run gradcheck for slow gradcheck (#154852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154852
Approved by: https://github.com/albanD
2025-06-04 07:47:09 +00:00
0f10df71a6 [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/guangyey
2025-06-04 07:16:56 +00:00
1e20745532 [ez][AOTI] Fix index offset for Optional Tensor Return (#155073)
Summary: As title. See added test for more context.

Test Plan:
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output_2

Rollback Plan:

Differential Revision: D75900658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155073
Approved by: https://github.com/angelayi
2025-06-04 06:22:46 +00:00
d2bfd97d71 [export] Refactor pt2 save/load (#152495)
Refactor the pt2 archive saving to consolidate the format of torch.export.save and torch._inductor.package.package_aoti.

This PR adds the following functions, which torch.export.save and AOTI packaging calls into:
```python
package_pt2(
    f: FileLike,
    *,
    exported_programs: Optional[Union[ExportedProgram, dict[str, ExportedProgram]]] = None,
    aoti_files: Optional[Union[list[str], dict[str, list[str]]]] = None,
    extra_files: Optional[dict[str, Any]] = None,
) -> FileLike

@dataclass
class PT2ArchiveContents:
    exported_programs: dict[str, ExportedProgram]
    aoti_runners: dict[str, AOTICompiledModel]
    extra_files: dict[str, Any]

load_pt2(f: FileLike) -> PT2ArchiveContents
```

Power users directly call into these APIs if they want to bundle multiple exported programs, aoti files, or extra metadata.

This is how the pt2 archive looks like ([spec](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?tab=t.0)):
```
├── archive_format
├── version
├── .data
├── data
│   ├── aotinductor
│   │   └── model1
│   │       ├── model1.cpp
│   │       ├── model1.so  # currently AOTI automatically moves weights in here, TODO to move it out
│   │       ├── cg7domx3woam3nnliwud7yvtcencqctxkvvcafuriladwxw4nfiv.cubin
│   │       └── cubaaxppb6xmuqdm4bej55h2pftbce3bjyyvljxbtdfuolmv45ex.cubin
│   ├── weights
│   │  ├── model1.pt  # TODO to dedup weights between model1/model2
│   │  └── model2.pt
│   └── constants
│   │  ├── model1.pt  # TODO to dedup weights between model1/model2
│   │  └── model2.pt
│   └── sample_inputs
│      ├── model1.pt  # TODO to dedup weights between model1/model2
│      └── model2.pt
├── extra
│   └── user_metadata.txt
└── models
    ├── model1.json
    └── model2.json
```

Future todos:
- unbundle the weights -- instead of .pt, we can use bin files, which will also allow us to dedup weights if we store multiple models
- update aoti_compile_and_package to also save the exported program
- integrate TNR with this packaging flow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152495
Approved by: https://github.com/yushangdi
2025-06-04 06:04:29 +00:00
75b24c273b Export torch::utils::tensor_to_numpy (#154178)
Fixes #154105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154178
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/youkaichao
2025-06-04 05:48:27 +00:00
7b074346e0 [Intel GPU] Support f32 intermediate dtype, headdim size <=576 and f32 causal mask for SDPA (#152091)
In OneDNN v3.7, SDPA has below defects:

1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.

In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg
2025-06-04 05:18:36 +00:00
4d93985d13 [c10d] Separate monitoring thread into a class in PGNCCL (#153977)
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.

What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)

Differential Revision: [D75799278](https://our.internmc.facebook.com/intern/diff/D75799278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-06-04 04:07:07 +00:00
ec35a36820 [ROCm][Windows] Fix building tests for multiple architectures (#154979)
Fixing building C10_CUDA_ALL_TEST_FILES and Caffe2_HIP_TEST_SRCS for multiple architectures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154979
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-04 03:53:21 +00:00
72fe1d5f42 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
ghstack dependencies: #154863
2025-06-04 03:37:09 +00:00
6b0c6f2856 [BE] Delete pre-CUDA-10.1 code from SparseCUDABlas (#155079)
As latest PyTorch is no longer buildable against it CUDA-10, so this is essentially a dead code

Made small change to hipify script to rename `cusparseGetErrorString` to `hipsparseGetErrorString`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155079
Approved by: https://github.com/atalman, https://github.com/cyyever
2025-06-04 03:29:24 +00:00
9f39028629 [MPS][BE] Move sigmoid op to Metal (#155080)
Fixes https://github.com/pytorch/pytorch/issues/154895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155080
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #154936, #155002, #155081
2025-06-04 03:28:11 +00:00
437df54cc8 [Inductor] Fix a few FX conversion bugs. (#154958)
# Feature
This PR fixes two bugs with Inductor's FX backend.
1. When extracting offsets from `ReinterpretView`'s, we accidentally took the offset of the parent layout instead of the view's layout. This case is triggered when multiple kernels write into the same buffer due to `torch.cat`.
2. In certain rare cases, `V.graph.graph_inputs` can contain a constant input value. In case this happens, create a new `sympy.Symbol` for the input, for compatibility with the existing `SymbolBuffer` abstraction mapping to an FX placeholder.  This case is triggered when calling `torch._inductor.compile` on  certain modules coming from `torch.export`.

# Test plan
Added a couple of tests exposing these bugs.
1. Concat with multiple kernels writing to the same buffer.
3. `Export` -> `torch._inductor.compile` with a constant input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154958
Approved by: https://github.com/jansel
2025-06-04 03:09:44 +00:00
3e57de1251 [ONNX] Create support for rotary embeddings (#154745)
This PR registers the RotaryEmbedding op in the `torch.ops.onnx` name spaces and allows the exporter to recognize and export onnx operators.

## Design

ONNX operators of their respective opset version is implemented in torch/onnx/ops/_impl.py, and are registered in the torch.ops.onnx namespace following the following rule:

`OpType-version => torch.ops.onnx.OpType.opset{version}`

For example, `RotaryEmbedding-23` becomes `torch.ops.onnx.RotaryEmbedding.opset23`

This name is parsed by the exporter to create an onnx node in the graph without having to go through translation.

When users use the ops in the model, we provide more convenient, unversioned functions under `torch.onnx.ops` that will dispatch to the implementations based on user input (type and provided attributes). For example, users can directly call `torch.onnx.ops.rotary_embedding()` to use the op natively in their pytorch models. I chose snake case naming to make the functions more pythonic and aligned with other torch apis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154745
Approved by: https://github.com/titaiwangms
2025-06-04 03:07:43 +00:00
37e6bf8adf Switch to _apply_to_tensors for dataclass input (#154897)
Fixes https://github.com/pytorch/pytorch/issues/153077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154897
Approved by: https://github.com/weifengpy
2025-06-04 02:19:52 +00:00
34e3930401 fix numpy compatibility for 2d small list indices (#154806)
Will fix #119548 and linked issues once we switch from warning to the new behavior,
but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive.
We will change the behavior after 2.8 branch is cut.
Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD
2025-06-04 01:58:52 +00:00
e2760544fa [PT] expose FlightRecord API for building (#154866)
Summary: as titled

Test Plan:
CI

Rollback Plan:

Differential Revision: D75803611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154866
Approved by: https://github.com/fduwjj, https://github.com/d4l3k
2025-06-04 01:25:52 +00:00
d8e4c1c363 [BE] Define REGISTER_UNARY_TI_DISPATCH (#155081)
That creates _kernel_mps function that takes iterator and calls stub for
it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155081
Approved by: https://github.com/dcci
ghstack dependencies: #154936, #155002
2025-06-04 01:15:37 +00:00
50de6ae253 Revert "[BE][Ez]: Fully type nn.utils.clip_grad (#154801)"
This reverts commit 9ce2732b685da527308dc2dc4b2eeb4e252f57d1.

Reverted https://github.com/pytorch/pytorch/pull/154801 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154801#issuecomment-2937886337))
2025-06-04 00:41:27 +00:00
40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
6c2f941e25 [dynamo][dynamic] Recompilation hint for nn module integer attributes (#154867)
For program like this

```
class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c = 0

    def forward(self, x):
        self.c += 1
        return x * self.c
```

You can check the recompile reasons at https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpzv9z6Q/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

![image](https://github.com/user-attachments/assets/856a95fd-0533-4abc-a213-1f73ae2cb766)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154867
Approved by: https://github.com/zou3519
ghstack dependencies: #154780
2025-06-04 00:05:53 +00:00
cbdacd32fe [AOTI][Intel GPU] Support multi_arch_kernel_binary option for XPU. (#154514)
Following the design of #154413, this PR add XPU support for generating kernel binary files that support multiple archs.

Fixes #154682, Fixes #154683, Fixes 154689, Fixes #154685 , Fixes #154690, Fixes #154681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154514
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2025-06-03 23:02:00 +00:00
8f0e3f446d [Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #154091
2025-06-03 23:01:05 +00:00
6c40e6606f [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-06-03 23:01:05 +00:00
4224a7df01 [Cutlass] EVT dynamic shapes support (#154835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154835
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #154775, #154761, #154829
2025-06-03 22:20:34 +00:00
36596ad2a0 [Cutlass] fp8 dynamic shapes test (#154829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154829
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #154775, #154761
2025-06-03 22:20:33 +00:00
1c2b9cecd2 [Cutlass] Support bias arg for fp8 GEMM (#154761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154761
Approved by: https://github.com/drisspg
ghstack dependencies: #154775
2025-06-03 22:20:27 +00:00
5735729597 [Cutlass] Cleanup gemm_template evt handling (#154775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154775
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-06-03 22:20:18 +00:00
71499fee6b [3/3] Add build rule and test for Graph in nativert (#154532)
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:

1. Add header file
2. Add source file
3. **Add test and build rules**

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75495273](https://our.internmc.facebook.com/intern/diff/D75495273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154532
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530, #154531
2025-06-03 21:52:05 +00:00
b4c399d445 [2/3] Add source file for Graph in nativert (#154531)
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:

1. Add header file
2. **Add source file**
3. Add test and build rules.

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75492405](https://our.internmc.facebook.com/intern/diff/D75492405/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154531
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530
2025-06-03 21:51:52 +00:00
55873dcb0d [1/3] Add header file for Graph in nativert (#154530)
We split the large PR for adding Graph.h and Graph.cpp to `nativert` into 3 smaller PRs:
1. **Add header file**
2. Add source file
3. Add test and build rules.

Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.

Differential Revision: [D75491860](https://our.internmc.facebook.com/intern/diff/D75491860/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154530
Approved by: https://github.com/SherlockNoMad
2025-06-03 21:51:47 +00:00
69a57d9486 add JSON output support for operator benchmark (#154410)
To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format
Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature.
At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression.
An example of the JSON output of the operator benchmark.
```
[
  {
    "benchmark": {
      "name": "PyTorch operator benchmark - add_M1_N1_K1_cpu",
      "mode": "inference",
      "dtype": "float32",
      "extra_info": {
        "input_config": "M: 1, N: 1, K: 1, device: cpu"
      }
    },
    "model": {
      "name": "add_M1_N1_K1_cpu",
      "type": "micro-benchmark",
      "origins": [
        "pytorch"
      ]
    },
    "metric": {
      "name": "latency",
      "unit": "us",
      "benchmark_values": [
        2.074
      ],
      "target_value": null
    }
  },
  {
    "benchmark": {
      "name": "PyTorch operator benchmark - add_M64_N64_K64_cpu",
      "mode": "inference",
      "dtype": "float32",
      "extra_info": {
        "input_config": "M: 64, N: 64, K: 64, device: cpu"
      }
    },
    "model": {
      "name": "add_M64_N64_K64_cpu",
      "type": "micro-benchmark",
      "origins": [
        "pytorch"
      ]
    },
    "metric": {
      "name": "latency",
      "unit": "us",
      "benchmark_values": [
        9.973
      ],
      "target_value": null
    }
  },
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410
Approved by: https://github.com/huydhn
2025-06-03 21:29:24 +00:00
8e1474d3c6 [inductor] small cleanups in torch/_inductor/codegen/mps.py (#154921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154921
Approved by: https://github.com/jansel, https://github.com/Skylion007
2025-06-03 20:57:25 +00:00
cyy
debd095149 Avoid index integer overflow in gemm_notrans_ (#154809)
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154809
Approved by: https://github.com/soulitzer
2025-06-03 19:28:34 +00:00
10c3e6ec43 [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-03 19:21:15 +00:00
cc96febb97 [dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)
I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-03 19:19:47 +00:00
ea7b233015 [flex attention][triton pin] triton_helpers shim for TMA apis (#154858)
Triton 3.4 will remove the experimental TMA apis: https://github.com/triton-lang/triton/pull/6488

To allow compatibility across different triton versions, we implement a shim layer which calls the new API if available, and otherwise falls back to the experimental API.

Test: `python test/inductor/test_flex_attention.py TestFlexAttentionCUDA.test_GQA_causal_mask_cuda` which previously fails w/ triton-lang/tritoncda4229558c5dca7f7c4734bedd3e596ebcae0b8, but now passes.

Note: we'll need to apply this for other things in inductor, this just does it for flex attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154858
Approved by: https://github.com/NikhilAPatel, https://github.com/drisspg
2025-06-03 19:15:48 +00:00
85fb13d0d1 [BE] Cleanup cuda 12.4 artifacts from scripts and workflows (#154893)
Remove artifacts. CUDA 12.4 was deprecated. hence no need to keep this code around

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154893
Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/tinglvv
2025-06-03 18:43:40 +00:00
c014e9d7cd [inductor][test] test_padding.py: use inductor TestCase instead of dynamo TestCase (#154935)
test_pad_3d_tensor fails if you run it multiple times in a row, because the cache is populated and inductor skips the logic that increments the counter.

To fix this, switch these tests to use inductor's TestCase / run_tests instead of dynamo's - this way, a fresh inductor cache is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154935
Approved by: https://github.com/Skylion007
2025-06-03 18:36:44 +00:00
e8183f8d3d add #pragma once to stable/library.h (#154920)
This shoulda been there and it was an oversight that it was not! We do not want the same translation unit to process this header multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154920
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-03 18:34:53 +00:00
6f7694f18f [dynamo] Reconstruct defaultdict properly (#154931)
`DefaultDictVariable` inherited `ConstDictVariable.reconstruct`, causing
dynamo to reconstruct a `DefaultDictVariable` into a dict rather than
defaultdict. This patch fixes that.

Fixes #138412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154931
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #154930
2025-06-03 18:18:40 +00:00
467235027c [AOTDispatch] Use the proper meta function for _amp_foreach_non_finite_check_and_unscale_ (#154930)
As title, this fixes part of #138412.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154930
Approved by: https://github.com/zou3519
2025-06-03 18:18:40 +00:00
462579af11 Update merge_rules.yaml (#155008)
- add new docs reviewers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155008
Approved by: https://github.com/malfet
2025-06-03 18:09:23 +00:00
f714599c57 [MPS][BE] Extend torch.special. to integer dtypes (#155002)
By changing the functor to looks as follows
```metal
struct xlog1py_functor {
  template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(c10:🤘:xlog1py(a, b));
  }
  template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
  inline float operator()(const T a, const T b) {
    return c10:🤘:xlog1py(float(a), float(b));
  }
};
```

Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155002
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #154936
2025-06-03 17:52:41 +00:00
31405a69fb [typing] Add missing type annotations to torch.nn.init module (#154504)
## Summary

Adds missing type annotations to `torch.nn.init` and removes `# mypy: allow-untyped-defs` since all functions are now properly typed.

## Changes

- Added missing type annotations to initialization functions in the module.
- Added missing typing imports: `Any`, `Callable`, `Union`
- Removed `# mypy: allow-untyped-defs` comment
- Create Literal types for kaiming initialization mode and nonlinearity.
- Created `__all__`

## Why

Better IDE support, catches type errors earlier, and brings the module up to PyTorch's typing standards. No runtime changes - purely additive typing improvements.

Tested with existing test suite and lintrunner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154504
Approved by: https://github.com/Skylion007
2025-06-03 17:33:32 +00:00
40142978d7 Add type annotation to orthogonal_ (#154927)
Trivial charge, but I want pyright to stop yelling at me
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154927
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 17:00:02 +00:00
1f131fe56b Update bug-report.yml (#154857)
Update issue template for binary data and numerical notes.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154857
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-03 16:13:07 +00:00
ff92b42fc3 [c10d][gloo] Integrate vendor generic FR into gloo (#152614)
This is a first quick prototyping for FR integration for gloo. Few features gaps:
- Input/Output numels for each collective
- Whether to use c10::Event or where to use it.
- Where to dump the FR traces. (The dump api is provided in this PR)

Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614
Approved by: https://github.com/d4l3k
ghstack dependencies: #154929
2025-06-03 16:12:54 +00:00
283f876ab6 [PP] Fix disabled flaky tests (#154856)
Fix https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](https://github.com/pytorch/pytorch/pull/153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154856
Approved by: https://github.com/Skylion007
2025-06-03 15:55:29 +00:00
250e9af4da Removing per torch.compile audit. (#154572)
Removing https://pytorch.org/docs/stable/torch.compiler_best_practices_for_backends.html per torch.compile audit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154572
Approved by: https://github.com/williamwen42, https://github.com/svekars
2025-06-03 15:41:52 +00:00
3685b10170 Turn on compile with NVSHMEM (#154538)
Before:
`USE_NVSHMEM=1` need to be explicit set in build environment.

After:
`USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154538
Approved by: https://github.com/ngimel
2025-06-03 15:24:24 +00:00
a1a268aff5 [dtensor] fix simplefsdp mixed-precision training bugs (#154975)
This is a follow-up on the previous dtensor redistribute PR: https://github.com/pytorch/pytorch/pull/150740, which enables SimpleFSDP's mixed-precision training.

In the most recent integration in TorchTitan: https://github.com/pytorch/torchtitan/pull/1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`.

This PR fixes this issue and corrects previously added test cases.

After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly.

![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154975
Approved by: https://github.com/tianyu-l
2025-06-03 14:47:36 +00:00
2608927cfb Solve for tilings (#153748)
Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses.

For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153748
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730
2025-06-03 14:37:30 +00:00
812deecaab Add option to define OpenBLAS version for manylinux Dockerfile_2_28_aarch64 (#150106)
Adds optional variable OPENBLAS_VERSION to `.ci/docker/common/install_openblas.sh` used to define which version of OpenBLAS to install. Adds argument to `Dockerfile_2_28_aarch64` image.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150106
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet

Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
2025-06-03 14:35:54 +00:00
0adbde4d35 Analyze coalesced mem (#153730)
Analyze memory expressions to see if they contain a coalescing symbol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153730
Approved by: https://github.com/jansel
ghstack dependencies: #153723
2025-06-03 14:29:06 +00:00
e9266f807a [BE] Use vendored packaging for testing (#154946)
As the rest of the torch uses it, test should rely on it as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154946
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 14:22:53 +00:00
9cdce682a1 [MPS][BE] Reimplement log1p as Metal shader (#154936)
That should make it faster than MPSGraph implementation, but also
improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$

Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} -  \frac{x}{3}))$

Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one.

Parametrize and modify regression test to check for accuracy of small values

TODOs:
 - Do proper implementation for complex values as well, perhaps using 0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)
 - May be implement it using Remez-like algorithm documented here 207f3b2b25/lib/msun/src/s_log1pf.c (L37)
 - Or use llvm's implementation from f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)
 - Benchmark which algorithm is faster and delivers better accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-06-03 14:10:13 +00:00
00dfd3891e [Tiling rewrite pt1] Normalize reads and writes to common iter space (#153723)
In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153723
Approved by: https://github.com/jansel
2025-06-03 14:04:34 +00:00
635b73e697 [dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.

Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
2025-06-03 11:50:57 +00:00
71a0af8a14 [TEST][Quantization] Skip test_learnable due to hypothesis (#152819)
As per comment in https://github.com/pytorch/pytorch/issues/111471#issuecomment-1866933243 the tests are failing due to hypothesis. This PR adds a skip to those tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152819
Approved by: https://github.com/eqy
2025-06-03 11:23:15 +00:00
ea5b9eca74 Combine sticky pgo key with job id (#154863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154863
Approved by: https://github.com/Mingming-Ding
2025-06-03 07:58:38 +00:00
a4da1d4a47 [Graph Partition] support standalone_compile (#154698)
For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698
Approved by: https://github.com/oulgen
2025-06-03 07:40:42 +00:00
d91c85babb [c10d][fr] Split cuda and non-cuda fr logic into two cpp file (#154929)
During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about  ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154929
Approved by: https://github.com/kwen2501
2025-06-03 07:00:14 +00:00
13044b2b04 Move c10/macros/Export.h to torch/standalone (#154850)
Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner.

Test Plan: CI

Bifferential Revision: D75756619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850
Approved by: https://github.com/janeyx99
2025-06-03 06:18:59 +00:00
a7e496a896 Revert "[dynamo] Record the pre-graph bytecode using fast record function event (#154769)"
This reverts commit 409c396a48584de1ab14e1be6957663d548ad89e.

Reverted https://github.com/pytorch/pytorch/pull/154769 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))
2025-06-03 06:13:49 +00:00
b86aaaae0b Revert "[dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)"
This reverts commit 7dee89913072f1499c5265d8e92d23c30fc6a7f1.

Reverted https://github.com/pytorch/pytorch/pull/154764 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](https://github.com/pytorch/pytorch/pull/154769#issuecomment-2933629894))
2025-06-03 06:13:49 +00:00
d375e64279 [cutlass backend][forward fix] hex the cutlass key instead of decode (#154885)
This is mainly following how it is done for torch_key.

Error was:
```
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154885
Approved by: https://github.com/jingsh, https://github.com/mlazos
2025-06-03 06:00:16 +00:00
8af447224e Improve error message for torch.fft.ihfft2 when input's dtype is complex (#149692)
Fixes #149625

For the case mentioned in the issue, will get:

```
RuntimeError: Only supports floating-point dtypes, but found: ComplexDouble
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149692
Approved by: https://github.com/malfet
2025-06-03 05:54:56 +00:00
295ea202f6 [inductor] Add kernel_hash_key to ChoiceCaller (#154470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470
Approved by: https://github.com/mlazos
2025-06-03 04:01:49 +00:00
cyy
388912dd94 Remove AttributeError constructor (#154808)
It is a private API and uses C vsnprintf, which is not type safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-03 03:49:09 +00:00
ef92653022 Revert "Remove AttributeError constructor (#154808)"
This reverts commit 3239da0c732c4ad736df7081ea44c1cd79c01145.

Reverted https://github.com/pytorch/pytorch/pull/154808 on behalf of https://github.com/cyyever due to Need format code ([comment](https://github.com/pytorch/pytorch/pull/154808#issuecomment-2933286113))
2025-06-03 03:40:41 +00:00
b3cb0e83de [FSDP2] respect reshard_after_forward=True for root model (#154704)
resolve https://github.com/pytorch/pytorch/issues/154655

`fully_shard(root, reshard_after_forward=True)` didn't really reshard parameters after forward, because we assumed root model will be used in backward immeidately. The assumption becomes invalid in 2 cases
* we have 3 roots for CLIP, T5, FLUX. we should reshard parameters are CLIP and T5 immeidately after their forward
for recommendation model, we may have mutiple root for dense part

Change default beahvior to always respect `reshard_after_forward=True`

Differential Revision: [D75663200](https://our.internmc.facebook.com/intern/diff/D75663200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154704
Approved by: https://github.com/mori360
2025-06-03 03:12:45 +00:00
ff35c0cdfd [inductor] Change _constexpr_to_value -> _unwrap_if_constexpr (#154905)
To adapt to the changes from: f480e2f697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154905
Approved by: https://github.com/davidberard98
2025-06-03 03:10:56 +00:00
cyy
e3cf73ee49 Move remaining CI jobs to VS 2022 (#154811)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154811
Approved by: https://github.com/huydhn
2025-06-03 02:21:24 +00:00
3239da0c73 Remove AttributeError constructor (#154808)
It is a private API and uses C vsnprintf, which is not type safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154808
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-03 02:18:51 +00:00
28cb3c0fe5 [test][inductor] attempt to fix duplicate registration issue (#154865)
Fixes #154216

In #154216, there's a duplicate registration error thrown from registering `test::foo` twice. I expect that this is caused by having two tests that both register a `test::foo` op in the same test file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154865
Approved by: https://github.com/NikhilAPatel, https://github.com/jingsh
2025-06-03 01:11:47 +00:00
6cb6da6ea2 [triton pin][test] relax codecache test checks for number of triton artifacts (#154879)
Triton has added another artifact that gets generated (triton-lang/triton#6992), so `test_cache_load_function` started failing as there are now 8 (instead of 7) artifacts.

Instead of figuring out a way to check exactly which set of artifacts will get generated, I instead modified the test to just check that there are _at least_ 6 artifacts, to account for different platforms (intel/amd/nvidia) and different triton versions (which may or may not have a `.source` artifact)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154879
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-06-03 00:52:54 +00:00
7f44b589be [dynamo] fix pruning locals with ShapeEnvSource (#154752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154752
Approved by: https://github.com/zhxchen17
2025-06-03 00:35:11 +00:00
47a142c3c2 [triton pin][tests] update inductor/profiler launch_(enter|exit)_hooks tests (#154894)
Fixes #154223

Triton has updated launch_(enter|exit)_hooks so that they are now in `knobs`. @danzimm already fixed this in #152457 - this just updates the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154894
Approved by: https://github.com/jingsh, https://github.com/NikhilAPatel
2025-06-03 00:14:14 +00:00
731acbfb0b [CI] Reuse old whl on PRs (#154662)
Turn off main branch only gating for reusing old whls
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154662
Approved by: https://github.com/huydhn
2025-06-03 00:10:39 +00:00
af9f18e87e [nativert] Free stale execution frames (#154636)
Summary:
This was implemented in SR due to caching of runtime instances building up and causing some memory usage spikes after some large amount traffic went through the model, and then once traffic went down, SR was still caching all the previous usage.

We need something similar on the Sigmoid side to make sure the static dispatch modules aren't hogging memory. Currently, all ExecutionFrame objects are being cached, and never freed if stale.

Test Plan:
Added extra execution frames in tmp commit D75257998 and ran local replayer test to confirm extra execution frames get cleaned up down to min size, which is set at 8

 {F1978532047}

Also tested by modifying load_net_predictor (modifications also in D75257998) to run benchmarkNumIterations twice - once with benchmarkNumThreads, and once with only one thread. Also set clearing interval at one second. Verified that execution frames get cleared when we drop down to one thread.

 {F1978558984}

```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```

Bifferential Revision: D75257992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154636
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-06-02 23:44:12 +00:00
37eb909c94 Revert "[Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)"
This reverts commit 7b25ff7cf2e6096c103da0068e417216a41be7a9.

Reverted https://github.com/pytorch/pytorch/pull/154091 on behalf of https://github.com/seemethere due to I root caused this PR to some failures, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures with my fix ([comment](https://github.com/pytorch/pytorch/pull/154091#issuecomment-2932848880))
2025-06-02 23:22:43 +00:00
ac65e94f45 Revert "[Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)"
This reverts commit 2dfc0e33273fe50dcbb3d363da02c8cc485b4adc.

Reverted https://github.com/pytorch/pytorch/pull/154110 on behalf of https://github.com/seemethere due to This is part of a stack with failures internally, I tried to resolve with https://github.com/pytorch/pytorch/pull/154923 but it looks like there are more failures ([comment](https://github.com/pytorch/pytorch/pull/154110#issuecomment-2932845168))
2025-06-02 23:20:11 +00:00
e3af628b0d Revert "Add CPython exception tests (#150789)"
This reverts commit 67fb9b7cc3f7d2ebbb104296f2b11776f4adbb22.

Reverted https://github.com/pytorch/pytorch/pull/150789 on behalf of https://github.com/seemethere due to This is failing upstream in trunk, see 67fb9b7cc3 ([comment](https://github.com/pytorch/pytorch/pull/150789#issuecomment-2932823586))
2025-06-02 23:12:15 +00:00
7dee899130 [dynamo][guards] Flush cache to more accurately measure guard overhead (#154764)
We observed that guard overhead at runtime using profiler traces was
higher than reported in this profiling function at the compile time.
After investigation, we found that f_locals are already in cache and
that was causing the guard overhead to be way smaller while profiling
during the compilation. To be more realistic, we flush the cache here.

Profiling the guard overhead during compilation (in addition to at
runtime) allows faster iteration time, and logging in tlparse and
internal databases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154764
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #154769
2025-06-02 23:01:58 +00:00
409c396a48 [dynamo] Record the pre-graph bytecode using fast record function event (#154769)
![image](https://github.com/user-attachments/assets/1d06618b-1c14-4ed5-ab7b-dcfecbb4d632)

Adds another event in the profiler traces. This can help us find models where pre-graph bytecode is very expensive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154769
Approved by: https://github.com/zou3519, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-06-02 22:33:27 +00:00
f6b83d4cc6 sort iteration over index vars (#154846)
Fix for https://github.com/pytorch/pytorch/issues/154741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154846
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-06-02 22:06:00 +00:00
d6420d4f85 [CI] Reuse old whl: replace the version (#154773)
Replace the git version, so whl name goes from `torch-something+git<old commit>` to `torch-something+git<new commit>`

Renamed a bunch of variables to hopefully be more clear

Tested on ef210ad54b
* Removed gating that prevents it from running on PRs (which is going to be merged soon)
* Removed gating that checks for which files can be changed (since this PR has stuff outside of the acceptable list)
* The above two allow the whl to be reused, and I added assert 1 == 2 in common_utils and checked that jobs failed (meaning they were using updated code despite not building)

Checked that the whl in the docker image has the right commit sha, didn't check torch.__version__ though
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154773
Approved by: https://github.com/malfet
2025-06-02 22:02:41 +00:00
e1644e40a7 [ez][TD] Fix TD indexer workflow (#154868)
Update docker image, and fix gpu flag env var

Example failure: https://github.com/pytorch/pytorch/actions/runs/15381170311/job/43272174443

Tested on 9cb28f03e5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154868
Approved by: https://github.com/Skylion007
2025-06-02 21:33:19 +00:00
104c31598f [cutlass backend][ez] Make load config from local more resilient (#154740)
Differential Revision: D75693211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154740
Approved by: https://github.com/ColinPeppler
2025-06-02 21:12:12 +00:00
731e635c95 Add CPython math/cmath tests (#150794)
Tests:
* test_math.py
* test_cmath.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_math" "test_cmath"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150794
Approved by: https://github.com/zou3519
2025-06-02 20:49:44 +00:00
67fb9b7cc3 Add CPython exception tests (#150789)
----

* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
2025-06-02 20:44:41 +00:00
48807d568e [CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169)
Contributing to the fix of #147383   and #154119

Additional steps required: 3218b1b684/.github/workflows/lint.yml cu118 needs to be updated.
Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv
2025-06-02 20:22:14 +00:00
9d3ad82ca7 [dynamo] Remove all skipIfTorchDynamo in test_tensor_creation_ops.py (#154693)
Looks like they are no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154693
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-06-02 20:14:35 +00:00
984b1a80e3 [ez] add docs for *eager_then_compile stances (#154818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154818
Approved by: https://github.com/williamwen42
ghstack dependencies: #154802, #154826, #154822, #154823, #154805
2025-06-02 19:04:35 +00:00
28f27886eb Vary batch size when running dynamic shapes benchmarks (#154805)
This better measures the actual runtime performance of dynamic shapes
where we aren't guaranteed to have similar shapes as the original hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826, #154822, #154823
2025-06-02 18:56:18 +00:00
33f2d0ff45 add reference to stances from dynamic shapes doc (#154823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154823
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
ghstack dependencies: #154802, #154826, #154822
2025-06-02 18:47:19 +00:00
d99e9568ec Add docs for how to mark as unbacked (#154822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154822
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826
2025-06-02 18:30:57 +00:00
1258aac1c2 [dynamo] Upcast torch.Size + tuple to be of size torch.Size (#154830)
Fixes https://github.com/pytorch/pytorch/issues/154432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154830
Approved by: https://github.com/StrongerXi, https://github.com/Skylion007, https://github.com/williamwen42
2025-06-02 17:57:23 +00:00
9fe1b40d17 [ez] add dynamic sources docs (#154826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154826
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802
2025-06-02 17:53:30 +00:00
69e22301da Revert "[inductor] Add kernel_hash_key to ChoiceCaller (#154470)"
This reverts commit 7a79de1c0f31200f95a48a9e69fbd2df2a3c735d.

Reverted https://github.com/pytorch/pytorch/pull/154470 on behalf of https://github.com/seemethere due to Failing internal inductor tests, author is aware and suggested revert. D75767762 ([comment](https://github.com/pytorch/pytorch/pull/154470#issuecomment-2931717432))
2025-06-02 17:43:23 +00:00
113224b530 Enable non blocking remote cache write (#154837)
Test Plan:
Ran
```
buck2 run mode/opt //scripts/oulgen:runner
```
twice
and got

https://fburl.com/scuba/pt2_remote_cache/u7u1uqh1

Differential Revision: D75770423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154837
Approved by: https://github.com/jamesjwu
2025-06-02 17:36:43 +00:00
67067512a1 Revert "[BE] Cleanup old ExecuTorch codegen and runtime code (#154165)"
This reverts commit 515c19a3856e953c0fe23a0ed4fa844f8eea34d8.

Reverted https://github.com/pytorch/pytorch/pull/154165 on behalf of https://github.com/seemethere due to This is failing when attempting to test against executorch main internally, author has acknowledged that this should be reverted ([comment](https://github.com/pytorch/pytorch/pull/154165#issuecomment-2931489616))
2025-06-02 16:28:46 +00:00
981bdb39ca Enable ConvTranspose3D for FP32 and Complex64 (#154696)
Fixes #154615

Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15.

For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs.

Tests for ConvTranspose3D were enabled for the supported data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696
Approved by: https://github.com/malfet
2025-06-02 16:24:03 +00:00
77d85a4629 Symintify baddbmm (#154656)
Previously we would specialize on the shape in this if-statement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154656
Approved by: https://github.com/pianpwk
2025-06-02 15:23:14 +00:00
e22be781b7 Symintify repeat_interleave (#154660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154660
Approved by: https://github.com/pianpwk
2025-06-02 15:19:39 +00:00
cyy
f6275bf0fe Bump pocketfft submodule to the latest (#154845)
Fixes #154843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154845
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-02 14:54:13 +00:00
dfd6849e77 Update lint_urls.sh (#154838)
Do not match empty urls pieces like "https://"
Add headers for better handling urls like "https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154838
Approved by: https://github.com/Skylion007
2025-06-02 14:50:34 +00:00
c65e9ad77a Update slow tests (#154347)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154347
Approved by: https://github.com/pytorchbot
2025-06-02 11:30:56 +00:00
ff4515fde5 Add optional check_pinning argument to _validate_sparse_compressed_tensor/coo_args (#154759)
As in the title.

A prerequisite to https://github.com/pytorch/pytorch/pull/154638 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154759
Approved by: https://github.com/amjames, https://github.com/ngimel
ghstack dependencies: #154610
2025-06-02 10:17:07 +00:00
3f3c1f419f User-controlled sparse tensor validation when loading data from external storage (#154610)
This PR lets users to control sparse tensor invariants validation (that can be expensive, especially, for sparse tensors with many indices) when loading data from external sources.

By default, the validation of sparse tensor invariants is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154610
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-02 10:17:07 +00:00
9258cfc227 [audio hash update] update the pinned audio hash (#154776)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154776
Approved by: https://github.com/pytorchbot
2025-06-02 05:36:13 +00:00
16d05e130c [CI][CUDA][UCC] Update test_c10d_ucc.py - remove xfailIfLinux because it now succeeds (#150979)
pytest -v test/distributed/test_c10d_ucc.py  -k test_save_load
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/opt/pytorch/pytorch/.hypothesis/examples'))
rootdir: /opt/pytorch/pytorch
configfile: pytest.ini
plugins: anyio-4.9.0, hypothesis-6.130.13, flakefinder-1.1.0, rerunfailures-15.0, xdist-3.6.1, xdoctest-1.0.2, typeguard-4.3.0
collected 63 items / 62 deselected / 1 selected
Running 1 items in this shard

test/distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint PASSED [65.2581s]                                                                                               [100%]

================================================================================== 1 passed, 62 deselected in 68.78s (0:01:08)

@ptrblck @eqy @tinglvv @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150979
Approved by: https://github.com/eqy
2025-06-02 03:24:35 +00:00
cd3d2b75b3 Update README.md - James has the wrong github link. (#151473)
Unless I'm wrong, the James on the pytorch paper is not the account linked to in the README.md.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151473
Approved by: https://github.com/albanD
2025-06-02 01:53:44 +00:00
515c19a385 [BE] Cleanup old ExecuTorch codegen and runtime code (#154165)
Summary: These files are added to pytorch/pytorch before ExecuTorch is
opensourced. Now is a good time to remove it from pytorch/pytorch, since
the code is moved to pytorch/executorch already.

Test Plan: Rely on CI jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154165
Approved by: https://github.com/kimishpatel, https://github.com/Skylion007, https://github.com/cyyever
2025-06-02 01:47:02 +00:00
0d0058d90d Fix flaky test in test_custom_ops (#152484)
Hopefully fixes https://github.com/pytorch/pytorch/issues/151301, https://github.com/pytorch/pytorch/issues/151281 by making the ops have different names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152484
Approved by: https://github.com/zou3519
2025-06-02 01:45:28 +00:00
80af98c6c3 [BE]: Update nlohmann submodule to 3.12.0 (#154817)
This is mostly compiler fixes, C++20 fixes, and clang-tidy fixes. Should be entirely backwards compatible with our current version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154817
Approved by: https://github.com/jansel, https://github.com/malfet
2025-06-02 01:29:58 +00:00
2b2245d5db [BE]: Replace printf with fmtlib call (#154814)
Safer, faster, more concise, and better type checking. Also add a few misc changes in the file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154814
Approved by: https://github.com/jansel
2025-06-01 22:27:08 +00:00
206e9d5160 [BE]: Update cpp-httplib submodule to 0.20.1 (#154825)
Updates cpp-httplib to 0.20.1. This mostly updates OSS with a bunch of CMake, CXX compiler errors, and bugfixes from upstream. It's a header only library so should be pretty straightforward to upgrade
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154825
Approved by: https://github.com/malfet
2025-06-01 21:44:23 +00:00
064bb3cebc [BE]: Replace a couple of call sites with fmtlib printf (#154533)
This is faster, and memory safe implementation of printf functions coming from fmtlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154533
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 21:16:34 +00:00
0350c7e72c [BE] Introduce torch.AcceleratorError (#152023)
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`

`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's  [`PyErr_SetString`](cb8a72b301/Python/errors.c (L282)), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](cb8a72b301/Python/errors.c (L32))
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references

Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'

import torch

x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
    x[y] = 2
    print(x)
except torch.AcceleratorError as e:
    print("Exception was raised", e.args[0])
    print("Captured error code is ", e.error_code)
```

which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
#12 at::native::abs(at::Tensor const&) from ??:0
#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#17 at::_ops::abs::call(at::Tensor const&) from ??:0
#18 at::native::isfinite(at::Tensor const&) from ??:0
#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
#22 PyObject_CallFunctionObjArgs from ??:0
#23 _PyObject_MakeTpCall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyObject_FastCallDictTstate from ??:0
#26 _PyStack_AsDict from ??:0
#27 _PyObject_MakeTpCall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 _PyFunction_Vectorcall from ??:0
#30 _PyEval_EvalFrameDefault from ??:0
#31 _PyFunction_Vectorcall from ??:0
#32 _PyEval_EvalFrameDefault from ??:0
#33 _PyFunction_Vectorcall from ??:0
#34 _PyEval_EvalFrameDefault from ??:0
#35 PyFrame_GetCode from ??:0
#36 PyNumber_Xor from ??:0
#37 PyObject_Str from ??:0
#38 PyFile_WriteObject from ??:0
#39 _PyWideStringList_AsList from ??:0
#40 _PyDict_NewPresized from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyEval_EvalCode from ??:0
#43 PyEval_EvalCode from ??:0
#44 PyUnicode_Tailmatch from ??:0
#45 PyInit__collections from ??:0
#46 PyUnicode_Tailmatch from ??:0
#47 _PyRun_SimpleFileObject from ??:0
#48 _PyRun_AnyFileObject from ??:0
#49 Py_RunMain from ??:0
#50 Py_BytesMain from ??:0
#51 __libc_init_first from ??:0
#52 __libc_start_main from ??:0
#53 _start from ??:0

Captured error code is  710
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: #154436
2025-06-01 21:02:43 +00:00
f7c09f864a [Docs] Reformat sparse example (#154785)
Not sure why, but rst fails to colorize multiline inputs, but works fine for single line commands
Test plan:
| [Before](https://docs.pytorch.org/docs/main/sparse.html#construction)  | [After](https://docs-preview.pytorch.org/pytorch/pytorch/154785/sparse.html#construction) |
| ------------- | ------------- |
| <img width="466" alt="image" src="https://github.com/user-attachments/assets/96a5c52a-1804-4d05-a5cf-c10221aaddf6" />  | <img width="477" alt="image" src="https://github.com/user-attachments/assets/99565288-5c0b-4e8e-bd60-f016ebc207b5" />  |

Fixes https://github.com/pytorch/pytorch/issues/154779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154785
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2025-06-01 20:56:14 +00:00
c2e9115757 Fix typo in dcp module (#154815)
Fixed the  docstring in `validate_checkpoint_id`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154815
Approved by: https://github.com/Skylion007
2025-06-01 18:18:45 +00:00
b90fc2ec27 [ez] delete code that died a long time ago (#154802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802
Approved by: https://github.com/Skylion007
2025-06-01 14:57:03 +00:00
0cd18ba1ca [BE][Ez] Update deprecated pybind11 functions (#154798)
* getType() is deprecated, replace it with new/proper static method. These are backwards compatible with old pybind11 versions we support. So break this off before we upgrade to pybind11 3.0 where these methods are dropped in #154115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154798
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-06-01 06:17:50 +00:00
bfae151269 [BE][Ez]: Remove unneeded mypy suppressions (#154800)
Improvements in typing have made this suppression unnecessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154800
Approved by: https://github.com/cyyever, https://github.com/jansel
2025-06-01 06:10:41 +00:00
9cbbc2593b test for 146431 (#154786)
Adds test for #146431 that was fixed by #154746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154786
Approved by: https://github.com/Skylion007, https://github.com/galv

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-06-01 04:17:54 +00:00
cyy
5616fa4a68 [Submodule] Bump flatbuffers to v24.12.23 (#143964)
This sub-module has not been updated for a long time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143964
Approved by: https://github.com/Skylion007
2025-06-01 02:25:57 +00:00
c33fc9dae3 [BE][Ez]: Update VulkanMemoryAllocator to 3.3.0 (#154796)
Last update to this submodule was 3 years ago, and the API is pretty stable and this is a minor version release update. Part of a bunch of PRs to eradicate low CMake required versions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154796
Approved by: https://github.com/jansel
2025-06-01 00:30:56 +00:00
9ce2732b68 [BE][Ez]: Fully type nn.utils.clip_grad (#154801)
Full types clip_grad and exposed typing annotations that were hidden by a bad decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154801
Approved by: https://github.com/jansel
2025-05-31 23:06:45 +00:00
dbad6d71c7 [BE][Ez]: Unskip conv1d MPS test (#154795)
Fixes issue I noticed where conv1d test is skipped for complex types unconditionally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154795
Approved by: https://github.com/jansel
2025-05-31 23:01:19 +00:00
b85c460749 [BE][Ez]: Update NVTX submodule to 3.2.1 (#154797)
Update NVTX3 submodule to 3.2.1.
* Mostly improved compiler support, Python support, and better CMake and C++ support.
* Also has a few new APIs to support fancy new features.
* This is header only library so should be an easy non-invasive change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154797
Approved by: https://github.com/jansel
2025-05-31 23:01:13 +00:00
6a781619bf Temporarily disable sparse tensor validation when loading from external storage. (#154758)
As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 .

The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-05-31 19:45:44 +00:00
c99e91b1d7 [BE]Enhance _get_clean_triton.py to auto-generate launch_params if missing (#154666)
Previously, @Chillee wrote a script https://github.com/pytorch/pytorch/pull/125811 to remove inductor dependency for inductor compiled triton kernels. We'd like to automate the process of obtaining the launch parameters.

Added functionality to the torch/utils/_get_clean_triton.py to automatically generate the launch_params file if it does not exist and the auto_generate_params flag is set to True. This includes running the input file in a subprocess with the appropriate environment variable. Updated the get_clean_triton function and the main script to support this new feature, allowing users to disable auto-generation via a command-line argument.

# Test Plan
test embedding op in TritonBench
```
# generate inductor compiled triton kernels
TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FX_GRAPH_CACHE=0 python run.py --op embedding  --mode fwd  --precision fp32 --metrics nsys_rep --only inductor_embedding  --num-inputs 1 --input-id 11
# run the script to get rid of inductor dependency. By default, triton_only_repro.py is the output file name.
python ~/pytorch/torch/utils/_get_clean_triton.py ~/tritonbench/torch_compile_debug/run_2025_05_29_14_47_50_497790-pid_849274/torchinductor/model__0_forward_1.0/output_code.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154666
Approved by: https://github.com/davidberard98
2025-05-31 19:27:56 +00:00
c014e4bcaa Fix typo in vec256 interleave2 (#154784)
Fix a typo where the elements in a vector are mislabeled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154784
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-05-31 14:17:10 +00:00
daff263062 [Functorch] Support Functorch for PrivateUse1 backend (#154700)
This PR enable that functorch to be used in 3rd party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154700
Approved by: https://github.com/zou3519
2025-05-31 07:28:45 +00:00
15e9119a69 [BE] install_triton_wheel.sh update for internal dev (#154637)
internal devgpu gets mad at `pip install ...` but `python3 -m pip install ...` is fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154637
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-31 06:57:56 +00:00
7368eeba5e [dynamo][guards] Prevent LENGTH guard on nn modules (#154763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154763
Approved by: https://github.com/williamwen42
2025-05-31 05:32:31 +00:00
7a79de1c0f [inductor] Add kernel_hash_key to ChoiceCaller (#154470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154470
Approved by: https://github.com/mlazos
2025-05-31 03:09:37 +00:00
bd10ea4e6c Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit ad26ec6abe51d528124bc5fbbacaa87aef077ab8.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923997777))
2025-05-31 02:14:24 +00:00
43390d8b13 ROCm Sparsity through HipSparseLT (#150578)
TLDR:

- This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity.
- Require **ROCm 6.4** && **gfx942/gfx950**.
- The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way.

### Dense vs. Sparse Performance Comparison

#### **NT (Row-major)**
**Average Uplift**: `1.20`

| M     | N      | K      | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-------|--------|--------|-------------------------|-------------------------------|--------|
| 14336 | 8      | 4096   | 20.05                   | 25.3                          | 1.26   |
| 4096  | 8      | 14336  | 21.07                   | 25.28                         | 1.20   |
| 3072  | 3072   | 10240  | 299.05                  | 351.82                        | 1.18   |
| 3072  | 1536   | 768    | 18.56                   | 20.05                         | 1.08   |
| 3072  | 17664  | 768    | 163.13                  | 173.91                        | 1.07   |
| 3072  | 196608 | 768    | 1717.30                 | 1949.63                       | 1.14   |
| 3072  | 24576  | 768    | 206.84                  | 242.98                        | 1.17   |
| 3072  | 6144   | 768    | 53.90                   | 56.88                         | 1.06   |
| 3072  | 98304  | 768    | 833.77                  | 962.28                        | 1.15   |
| 768   | 1536   | 768    | 8.53                    | 19.65                         | 2.30   |
| 768   | 17664  | 768    | 46.02                   | 46.84                         | 1.02   |
| 768   | 196608 | 768    | 463.15                  | 540.46                        | 1.17   |
| 768   | 24576  | 768    | 54.32                   | 59.55                         | 1.10   |
| 768   | 6144   | 768    | 19.47                   | 20.15                         | 1.03   |
| 768   | 98304  | 768    | 231.88                  | 258.73                        | 1.12   |

---

#### **NN (Row-major)**
**Average Uplift**: `1.13`

| M   | N      | K     | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-----|--------|-------|-------------------------|-------------------------------|--------|
| 768 | 1536   | 3072  | 27.50                   | 28.78                         | 1.05   |
| 768 | 17664  | 3072  | 125.06                  | 158.94                        | 1.27   |
| 768 | 196608 | 3072  | 1568.38                 | 1767.12                       | 1.13   |
| 768 | 24576  | 3072  | 171.05                  | 203.49                        | 1.19   |
| 768 | 6144   | 3072  | 58.72                   | 60.39                         | 1.03   |
| 768 | 98304  | 3072  | 787.15                  | 887.60                        | 1.13   |

-------------------------

This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt.

### ROCm and hipSPARSELt Support:

* [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support.
* [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version.
* [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration.
* [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro.
* `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153)

### Codebase Updates:

* [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285)

### Test Suite Updates:

* [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578
Approved by: https://github.com/jeffdaily
2025-05-31 02:03:40 +00:00
cyy
ad26ec6abe Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 01:54:35 +00:00
3e71016459 Revert "Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298)"
This reverts commit 489afa829a248ca64c4b2dffe2e6d601b8816cf9.

Reverted https://github.com/pytorch/pytorch/pull/154298 on behalf of https://github.com/izaitsevfb due to breaks linux-jammy-aarch64-py3.10 / build ([comment](https://github.com/pytorch/pytorch/pull/154298#issuecomment-2923966688))
2025-05-31 01:51:59 +00:00
489afa829a Aten vector default constructors set to 0, add fnmadd and fnmsub (#154298)
Test Plan: The only functional change is zero-initialization instead of undefined-initialization. If tests pass, I think it should be fine.

Differential Revision: D75345074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154298
Approved by: https://github.com/swolchok
2025-05-31 01:32:45 +00:00
472773c7f9 [nativert] move OpKernelKind enum to torch (#154756)
Summary: att

Test Plan: ci

Differential Revision: D75703996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154756
Approved by: https://github.com/zhxchen17, https://github.com/cyyever
2025-05-31 01:31:29 +00:00
f01e628e3b Resubmit Remove MemPoolContext (#154042) (#154746)
Summary: Per title

Test Plan: Added tests + existing tests

Differential Revision: D75695030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746
Approved by: https://github.com/malfet
2025-05-31 01:21:54 +00:00
932733e0e6 Fix memory leaks in mps_linear_nograph (#154765)
Fixes some memory leaks which were identified as part of the investigation of https://github.com/pytorch/pytorch/issues/154329. This doesn't appear to be the whole solution but wanted to merge this anyway since it's a quick fix

In my tests I see roughly 3MB of unexpected memory growth before this change, and after this change I see 2.2MB of memory growth
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154765
Approved by: https://github.com/malfet
2025-05-31 00:46:12 +00:00
108422ac26 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 78624679a876a21acb14bf075ba6beccff21b9a0.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))
2025-05-31 00:28:03 +00:00
da4aacabac Add h100_distributed label (#154562)
Add h100_distributed label, testing distributed 3D composability tests on 8*H100 GPU node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154562
Approved by: https://github.com/seemethere
2025-05-31 00:17:43 +00:00
9b5308cd58 [upstream triton] support build with setup.py in ./python/ or in ./ (#154635)
Upstream triton has moved setup.py from python/ to ./.  This PR allows versions to be buildable by checking the location of setup.py and choosing the cwd of the build commands based on the location.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154635
Approved by: https://github.com/atalman
2025-05-31 00:15:43 +00:00
b019a33f8f [ez][CI] Reuse old whl: remove old zip/whl (#154770)
Forgot that unzip doesn't get rid of the zip so the old one is still there

Unrelated: figure out how to update the git version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154770
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-05-31 00:13:24 +00:00
0fab32290a Revert "[draft export] avoid storing intermediate real tensors in proxies (#154630)"
This reverts commit 5acb8d50801e6d110790993464611314dd1bd54b.

Reverted https://github.com/pytorch/pytorch/pull/154630 on behalf of https://github.com/malfet due to This still ooms, at least occasionally see 78624679a8/1 ([comment](https://github.com/pytorch/pytorch/pull/154630#issuecomment-2923759745))
2025-05-31 00:07:56 +00:00
faf973da5e [refactor] move materialize_as_graph to _higher_order_ops/utils.py (#154070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154070
Approved by: https://github.com/zou3519
2025-05-31 00:06:44 +00:00
cyy
78624679a8 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 00:01:52 +00:00
5f1c3c67b2 [pgo] log dynamic whitelist in PT2 Compile Events (#154747)
Summary: logs the whitelist to PT2 Compile Events

Test Plan: loggercli codegen GeneratedPt2CompileEventsLoggerConfig

Reviewed By: bobrenjc93

Differential Revision: D75617963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154747
Approved by: https://github.com/angelayi
2025-05-30 23:54:24 +00:00
bbda22e648 [BE][Ez]: Optimize unnecessary lambda with operator (#154722)
Automated edits performed by FURB118. Operator is implemented in C and way faster when passed to another C method like sorted, max etc as a `key=`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154722
Approved by: https://github.com/jansel
2025-05-30 23:47:10 +00:00
0f3db20132 [ez][CI] Do not reuse old whl if deleting files (#154731)
Thankfully very few commits actually delete files so I don't think has affected anything
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154731
Approved by: https://github.com/Skylion007
2025-05-30 22:35:13 +00:00
eb93c0adb1 [inductor][AMD] support special kwargs in AMD triton configs (#154605)
**Context**:

AMD triton kernels can be launched with special kwargs, like `waves_per_eu`. Triton configs with these kwargs look like this:

```
triton.Config({
    "BLOCK_SIZE": 64,
    "waves_per_eu": 2,
})
```

in comparison, nvidia's special kwargs are explicit parameters on the config, e.g. num_warps:

```
triton.Config(
    {"BLOCK_SIZE": 64},
    num_warps=4,
)
```

**Problem**: this causes custom triton kernels w/ PT2 to error out, because there's a kwarg in the triton.Config that doesn't appear in the kernel signature.

**Solution**: When splicing in the constexpr values into the arg list, ignore any values in the config kwargs list if they don't appear in the function signature.

Differential Revision: [D75599629](https://our.internmc.facebook.com/intern/diff/D75599629/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D75599629/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154605
Approved by: https://github.com/njriasan
2025-05-30 22:24:32 +00:00
1193bf0855 Revert "convert inductor codecache to use getArtifactLogger (#153766)"
This reverts commit 5b6fd277f954b789649501e21e9689a42d565e13.

Reverted https://github.com/pytorch/pytorch/pull/153766 on behalf of https://github.com/malfet due to I want to revert this change as I'm 90+% certain it somehow broke testing ([comment](https://github.com/pytorch/pytorch/pull/153766#issuecomment-2923620806))
2025-05-30 22:20:07 +00:00
26aa8dcf27 [ONNX] Simplify onnx test dependencies (#154732)
Simplify onnx test dependencies and bump onnxscript to 0.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154732
Approved by: https://github.com/Skylion007
2025-05-30 21:58:04 +00:00
5acb8d5080 [draft export] avoid storing intermediate real tensors in proxies (#154630)
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.

While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.

Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?

Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-05-30 21:06:55 +00:00
abc2264e8f remove another instance of mtia_workloadd from pytorch (#154739)
Summary: ^

Test Plan: CIs

Differential Revision: D75692171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154739
Approved by: https://github.com/sraikund16
2025-05-30 20:50:46 +00:00
22a4cabd19 [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 20:32:56 +00:00
ed1ff7d0fb [BE][Ez]: Update mimalloc submodule to 2.2.3 (#154720)
Updating minor version of mimalloc. The old version is more than 2 years old, and the newer release has performance fixes and compiler fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154720
Approved by: https://github.com/jansel
2025-05-30 20:17:13 +00:00
2f03673ebf [BE][Ez]: Enable ClangFormat aten/src/core/Formatting.cpp (#154719)
Follow up to #152830 . Noticed the file was excluded from fromatting, opt in to clang-format since it's really close anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154719
Approved by: https://github.com/jansel
2025-05-30 19:52:43 +00:00
f57754e815 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#154618)
This is a follow-up PR of the reverted one https://github.com/pytorch/pytorch/pull/148981 re-opening for visibility :

Modified TorchInductor’s autotuning flow so that each best_config JSON file also includes the Triton “base32” (or base64) cache key.

Motivation

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set store_cubin = True since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154618
Approved by: https://github.com/jansel
2025-05-30 19:30:25 +00:00
d6edefefbf [CUDA] Fixes for backwards in memefficient attn for large tensors (#154663)
followup to #154029.

@ngimel Backwards had the same problem as well so this PR fixes it and adds support for logsumexp computation in the forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154663
Approved by: https://github.com/ngimel
2025-05-30 19:30:07 +00:00
d89d213118 Fix test_tensorboard when started w/o tensorboard package (#154709)
If `TEST_TENSORBOARD == False` then `DataType` is not defined or imported. However it is used unconditionally when defining the test with `parametrize` which leads to an NameError crashing the test execution on start.

Provide a Dummy to make it syntactially correct. Tests will be skipped on start.

```
  File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 885, in <module>
    class TestTensorProtoSummary(BaseTestCase):
  File "/dev/shm/build/pytorch-v2.2.1/test/test_tensorboard.py", line 889, in TestTensorProtoSummary
    (torch.float16, DataType.DT_HALF),
                    ^^^^^^^^
NameError: name 'DataType' is not defined
Got exit code 1, retrying...
test_tensorboard 1/1 failed! [Errno 2] No such file or directory: '/dev/shm/build/pytorch-v2.2.1/.pytest_cache/v/cache/stepcurrent/test_tensorboard_0_0dba8bc00bbe233f'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154709
Approved by: https://github.com/Skylion007
2025-05-30 19:18:43 +00:00
22641f42b6 [Binary-builds]Use System NCCL by default in CI/CD. (#152835)
Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image.

Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835
Approved by: https://github.com/malfet
2025-05-30 18:51:48 +00:00
967937872f [dynamo] Remove dead code path for torch.Tensor.view(*shape) (#154646)
This was introduced in early days of Dynamo, and looks like it's been
fixed since -- the regression test `test_transpose_for_scores` passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154646
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #154645
2025-05-30 18:50:58 +00:00
f9dc20c7a3 [dynamo] Fix syntax error in aot graph from kwarg-less torch.Tensor.[random_|uniform_] calls (#154645)
As title, fixes #151432, see more context in the issue discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154645
Approved by: https://github.com/zou3519
2025-05-30 18:50:58 +00:00
fb67fa9968 Revert "[Inductor] Add NaN assert to returned values from generated code (#154455)"
This reverts commit aec3ef100844631cb7c4ce2725157984eb9cebfe.

Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/malfet due to Looks like it broke inductor/test_compile_subprocess.py::CpuTests::test_AllenaiLongformerBase, see 35fc5c49b4/1(default%2C%20&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2923154249))
2025-05-30 18:45:01 +00:00
35fc5c49b4 Revert "[internal] Expose additional metadata to compilation callbacks (#153596)"
This reverts commit f889dea97dad3cc506d43e379a469334417040c8.

Reverted https://github.com/pytorch/pytorch/pull/153596 on behalf of https://github.com/izaitsevfb due to introduces bunch of callback-related failures on rocm ([comment](https://github.com/pytorch/pytorch/pull/153596#issuecomment-2923139061))
2025-05-30 18:39:27 +00:00
b6b9311f4f [BE][Ez]: Fix typo in dynamo utils #154639 (#154748)
Fixes a typo in #154639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154748
Approved by: https://github.com/ngimel
2025-05-30 18:39:01 +00:00
bbdf469f0e Add CPython dict tests (#150791)
Tests:
* test_dict.py
* test_ordered_dict.py
* test_userdict.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_dict" "test_ordered_dict" "test_userdict"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150791
Approved by: https://github.com/zou3519
2025-05-30 18:17:09 +00:00
2120eeb8de [BE][Ez]: Improve dynamo utils typing with TypeIs and TypeGuard (#154639)
Adds some additional TypeIs and TypeGuard to some _dynamo utils for additional type narrowing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154639
Approved by: https://github.com/jansel
2025-05-30 18:09:50 +00:00
1b569e5490 Fix load_state_dict description (#154599)
Fixes #141364

Fix missing description in `assign` param

## Test Result

### Before
![image](https://github.com/user-attachments/assets/5928c691-4e31-463b-aa0a-86eb8bb452e5)

### After
![image](https://github.com/user-attachments/assets/036631a2-0f20-4a71-95c3-2c0fd732293e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154599
Approved by: https://github.com/colesbury, https://github.com/mikaylagawarecki
2025-05-30 18:08:59 +00:00
30ac7f4d4e [EZ/Memory Snapshot] Remove Handle even if compile_context not set (#154664)
Summary: When setting the memory snapshot callback we register and unregister callbacks for performance reasons. For ease of use, it makes sense to just remove all callbacks regardless of which flags are enabled. The enable stays behind a feature flag, this just changes the disable to ignore the flag itself.

Test Plan: Ran without any flags and saw all callbacks removed.

Differential Revision: D75636035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154664
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2025-05-30 18:08:37 +00:00
65d8dba735 [nativert] move layout planner settings to torch (#154668)
Summary: att

Test Plan: ci

Differential Revision: D75633031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154668
Approved by: https://github.com/zhxchen17
2025-05-30 17:33:27 +00:00
3bdceab124 [dynamo] fix: added star operator for graph_break_hints (#154713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154713
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-05-30 17:31:03 +00:00
802ffd06c8 [Export] Add math module for deserialization (#154643)
Summary: As title

Test Plan: ci

Differential Revision: D75580646

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154643
Approved by: https://github.com/yushangdi
2025-05-30 17:29:25 +00:00
fc0135ca11 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-30 17:23:36 +00:00
3027051590 [export] avoid float/bool specialization for scalar tensor construction (#154661)
Fixes #153411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154661
Approved by: https://github.com/angelayi
2025-05-30 17:18:21 +00:00
e7bf72c908 [multigraph] fix composabilty with aotautograd cache (#153526)
AOTAutogradCache uses FXGraphCache which uses the tracing context to get the ShapeEnv. Although the TracingContext global_context is cleared by the time we get around to reusing it, we don't actually need it. We just need the ShapeEnv in the TracingContext, which isn't cleared at the end of dynamo and does persist. This PR adds the tracing context manager around the specialized compile to ensure our caching infrastructure can get access to the ShapeEnv. A test was also added to prove correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153526
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #153433, #153449
2025-05-30 16:56:17 +00:00
7183f52675 [dynamo] Support namedtuple subclass (#153982)
Fixes #133762. This involves
1. support tuple subclass constructed inside compile region.
2. handle the "fake" global scope associated with NamedTuple-generated
   `__new__`.
3. handle `namedtuple._tuplegetter` more faithfully.

Differential Revision: [D75488091](https://our.internmc.facebook.com/intern/diff/D75488091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153982
Approved by: https://github.com/jansel
ghstack dependencies: #154176
2025-05-30 16:14:37 +00:00
8002d22ce3 [dynamo] Trace into descriptor with __set__ (#154176)
As title, this patch basically implements
https://github.com/python/cpython/blob/3.11/Objects/object.c#L1371-L1452,
and make the `__get__` handling more robust.

I ran into this while fixing #133762.

Differential Revision: [D75488090](https://our.internmc.facebook.com/intern/diff/D75488090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154176
Approved by: https://github.com/jansel
2025-05-30 16:14:37 +00:00
31f95b5d2e Revert "inductor codecache: include private inductor configs in cache key (#153672)"
This reverts commit 2c1cb38d9516e10474b4f12a2e839046648a71a8.

Reverted https://github.com/pytorch/pytorch/pull/153672 on behalf of https://github.com/malfet due to Looks like it regressed pr_time_benchmarks, see ba3f91af97/1 ([comment](https://github.com/pytorch/pytorch/pull/153672#issuecomment-2922759739))
2025-05-30 15:54:14 +00:00
4b1f047a33 Add CPython list/tuple tests (#150790)
Tests:
* test_list.py
* test_tuple.py
* test_userlist.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:

```bash
for f in "test_raise" "test_list" "test_tuple" "test_userlist"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150790
Approved by: https://github.com/williamwen42
2025-05-30 15:53:38 +00:00
ba3f91af97 Type hints for distributions/utils (#154712)
Fixes #144196
Part of #144219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154712
Approved by: https://github.com/Skylion007
2025-05-30 15:50:31 +00:00
0f81c7a28d [CI] Pin the torchao version used when testing torchbench (#154723)
Summary: To fix a recent CI breakage. As a follow-up, the torchao pin in .github/ci_commit_pins/torchao.txt is 6-month old. We should bump up that once we verify this fix works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154723
Approved by: https://github.com/eellison
2025-05-30 15:04:26 +00:00
7e8532077f Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 1ece53b157db4425ad12cae31fb570c591dc19e7.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))
2025-05-30 13:16:33 +00:00
cyy
1ece53b157 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-30 11:25:30 +00:00
9d6f0d5991 avoid sym_max on nested int in is_contiguous. (#154633)
calling is_contiguous will fail due to sym_max not being supported for nested int, this address in a way consistent with
make_contiguous_strides_for
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154633
Approved by: https://github.com/bobrenjc93
2025-05-30 09:59:33 +00:00
3c05167489 [Intel GPU] fix matmul accuracy when offset > 0 (#154495)
This pr will make matmul tensors contiguous if they are not 64 byte alignment. oneDNN requires a minimal alignment of 64 https://uxlfoundation.github.io/oneDNN/dev_guide_c_and_cpp_apis.html#intel-r-processor-graphics-and-xe-architecture-graphics

Fixes https://github.com/intel/torch-xpu-ops/issues/1656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154495
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang
2025-05-30 09:53:51 +00:00
aec3ef1008 [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 08:53:24 +00:00
dc82e911e7 remove allow-untyped-defs from torch/utils/data/datapipes/iter/filelister.py (#154624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154624
Approved by: https://github.com/Skylion007
2025-05-30 08:38:05 +00:00
639f459cb6 Revert "[Inductor] Add NaN assert to returned values from generated code (#154455)"
This reverts commit c3de2c7c6bc865b9fabd2db8f2af6383936aa653.

Reverted https://github.com/pytorch/pytorch/pull/154455 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to see if it help fix the broken trunk below.  It it does not help, I will reland the PR ([comment](https://github.com/pytorch/pytorch/pull/154455#issuecomment-2921562089))
2025-05-30 08:11:22 +00:00
f889dea97d [internal] Expose additional metadata to compilation callbacks (#153596)
These hooks are used by internal stuck job detection to associate compilation events with the compile lease. Previously, we only had events for Dynamo and Inductor compilation. And recently, the callback handler was updated to ignore nested events. So the Inductor event was only really used by lazy backward.

Here, I remove the inductor event, and add an explicit lazy backward one. Additionally, I add other runtime compilation events: autotuning and cudagraphs. I also expose the CompileId as a string to avoid imports, this will let internal UIs track each graph's contribution to the timeout.

```python
class CallbackTrigger(enum.Enum):
    # most common case, dynamo attempts to trace a new frame
    DYNAMO = 1
    # backward compilation can be deferred to runtime
    LAZY_BACKWARD = 2
    # some backends autotune at runtime
    TRITON_AUTOTUNING = 3
    # cudagraphs record at runtime
    CUDAGRAPH_RECORDING = 4
```

Differential Revision: [D75092426](https://our.internmc.facebook.com/intern/diff/D75092426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153596
Approved by: https://github.com/masnesral
2025-05-30 08:07:04 +00:00
208965a9d6 Fix unbackend symint error (#154672)
## Summary

Me and @laithsakka  spoke offline about this one, TLDR is that we wanted this
![image](https://github.com/user-attachments/assets/2e537612-3261-4fbe-a6b9-f8ff92ba3c37)

to also be true for Inductor. In that vein we added two new apis to size-vars which is `guard_or_false`, or `guard_or_true`
with the semantics:

guard_or_false, guard_or_true:

Those APIs may add guards, but will never fail with data-dependent errors; They will try to evaluate the expression with the possibility of adding guards, if that fails due to data dependency, instead of hard failing. False or True are returned.

When to use this?

Performance optimizations that warrant a recompilation.

Take the general path and add a runtime check.
```
# Consider this branching.
if x==0:
    return 1
else
    return 10
# To make data dependent friendly, it can be written as the following:
if guard_or_false(x==0):
    return 1
else
  torch.check(x!=0) # runtime check
  return 10
```

However there is still 1 more api to add to make this example work which is the torch.check which works with expressions, I will leave that to the @laithsakka

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154672
Approved by: https://github.com/laithsakka
2025-05-30 07:45:01 +00:00
5a7442b91f remove allow-untyped-defs from torch/distributed/checkpoint/resharding.py (#154626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154626
Approved by: https://github.com/Skylion007
2025-05-30 07:43:04 +00:00
d66a55def0 remove allow-untyped-defs from torch/distributed/elastic/utils/logging.py (#154625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154625
Approved by: https://github.com/Skylion007
2025-05-30 07:37:56 +00:00
382b38ed1b remove allow-untyped-defs from torch/nn/utils/_expanded_weights/conv_expanded_weights.py (#154623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154623
Approved by: https://github.com/Skylion007
2025-05-30 07:32:57 +00:00
bcbd2a22b2 [Intel GPU] OneDNN primitive cache support for Int4 WOQ gemm on XPU (#147693)
* add onednn primitive cache for int4 gemm for xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147693
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/guangyey, https://github.com/ZhiweiYan-96

Co-authored-by: Yan, Zhiwei <zhiwei.yan@intel.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-30 07:26:36 +00:00
0df96e3921 remove allow-untyped-defs from torch/ao/quantization/stubs.py (#154622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154622
Approved by: https://github.com/Skylion007
2025-05-30 07:26:09 +00:00
30f7079c93 [FSDP2] allow different dtypes for no grad model params (#154103)
Fixes #154082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154103
Approved by: https://github.com/weifengpy
2025-05-30 07:00:54 +00:00
d173ba5a75 Revert "Remove MemPoolContext (#154042)"
This reverts commit 3b38989b5f8f918cf1ad38bdade059608544af4b.

Reverted https://github.com/pytorch/pytorch/pull/154042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154042#issuecomment-2921401100))
2025-05-30 06:53:37 +00:00
0fdd568b78 [forward fix] add support for MemoryFormat after type tightening (#154658)
Summary:
fixes error:
```
    raise AssertionError(f"Unexpected type in c_type_for_prim_type: {type_=}")
AssertionError: Unexpected type in c_type_for_prim_type: type_=MemoryFormat
```

after https://github.com/pytorch/pytorch/pull/154371 | D75568111

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/test:test_custom_ops -- --exact 'deeplearning/aot_inductor/test:test_custom_ops - test_export_extern_fallback_nodes (deeplearning.aot_inductor.test.test_custom_ops.TestAOTInductorProxyExecutor)'
```

Differential Revision: D75617432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154658
Approved by: https://github.com/Camyll, https://github.com/atalman, https://github.com/malfet
2025-05-30 06:53:25 +00:00
a4b0023f3b [cutlass backend] Cache config generation locally and remotely (#154686)
Summary:
Trying to cache the json list of configs.

There are probably some more work:
* preset
* filelock (?)
* for cases where we generate from scratch, save it to local as well (?)

Test Plan: tested offline

Reviewed By: coconutruben

Differential Revision: D75334439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154686
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-05-30 05:40:46 +00:00
ba51f4876d Revert "Enable C++ dynamic shape guards by default (#140756)"
This reverts commit dc0f09a4785349fc3b4e4d3dc3c02b018e5a0534.

Reverted https://github.com/pytorch/pytorch/pull/140756 on behalf of https://github.com/izaitsevfb due to seem to break dynamo tests ([comment](https://github.com/pytorch/pytorch/pull/140756#issuecomment-2921151663))
2025-05-30 03:52:02 +00:00
852b99eba0 Revert "[c10d] Separate monitoring thread into a class in PGNCCL (#153977)"
This reverts commit 0db9c64d68dcdf25210357c4f7a41618441091d4.

Reverted https://github.com/pytorch/pytorch/pull/153977 on behalf of https://github.com/izaitsevfb due to breaks lots of jobs internally, safer to revert, see D75628917 ([comment](https://github.com/pytorch/pytorch/pull/153977#issuecomment-2921146129))
2025-05-30 03:46:43 +00:00
20ee5f9044 remove allow-untyped-defs from elastic_distributed_sampler.py (#154620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154620
Approved by: https://github.com/Skylion007
2025-05-30 03:29:45 +00:00
9c06dff1ce [multigraph] use specializations in compile_and_call_fx_graph (#153449)
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.

There's really two parts of this work:

**The frontend changes:**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.

**The backend changes (this PR):**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.

I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153449
Approved by: https://github.com/zou3519
ghstack dependencies: #153433
2025-05-30 03:19:49 +00:00
c3de2c7c6b [Inductor] Add NaN assert to returned values from generated code (#154455)
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.

Test Plan:
NaN asserts properly generated in example gemm script:

    vars = (buf1, primals_2, buf2, primals_1, )
    for var in vars:
        if isinstance(var, torch.Tensor):
            assert not var.isnan().any().item()
            assert not var.isinf().any().item()

Differential Revision: D74691131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
2025-05-30 03:09:37 +00:00
4a302b5731 NativeRT readme (#154581)
Summary: att

Test Plan: ci

Differential Revision: D75557667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154581
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17, https://github.com/yiming0416
2025-05-30 02:50:53 +00:00
adfd5b293a Enhance UT on elapsed_time for XPUEvent (#154494)
# Motivation
UT enhancement to avoid the incorrect elapsed time return by xpu's Event.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154494
Approved by: https://github.com/EikanWang
2025-05-30 02:00:02 +00:00
0289313551 [AOTI] Support OptionalTensor return type in AOTI proxy executor (#154286)
Summary:

When a C++ custom op returns an uninitialized tensor, it will be marked as None in Python. For this scenario, the user should mark the possibly uninitialized return as Tensor? in the custom op schema.
This diff adds `as_optional_tensor` type to export schema and the support for optional tensor in AOTI proxy executor.

Test Plan:

```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor_custom_ops -- -r test_fn_with_optional_tensor_output
```

Differential Revision: D75262529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154286
Approved by: https://github.com/desertfire
2025-05-30 01:53:00 +00:00
58ead04ee9 [dynamic shapes] unbacked safe unsqueeze (#154087)
Also ran into this working on https://github.com/SWivid/F5-TTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154087
Approved by: https://github.com/laithsakka
2025-05-30 01:41:57 +00:00
172015fc11 [multigraph] add specialize_on kwarg to mark_{dynamic,unbacked} (#153433)
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.

There's really two parts of this work:

**The frontend changes (this PR):**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.

**The backend changes:**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.

I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

![495269647_576053105510082_9189856138964956774_n](https://github.com/user-attachments/assets/66030fed-d62e-4d87-940f-aa13c99b1a73)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153433
Approved by: https://github.com/zou3519
2025-05-30 01:08:15 +00:00
9371491529 [Reland][pytorch] Patch the _is_conv_node function (#154473)
Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257

Test Plan:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```

Differential Revision: D75494468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473
Approved by: https://github.com/Skylion007
2025-05-30 00:41:03 +00:00
d6cb0fe576 [MPS] Extend index_copy support to complex dtypes (#154671)
Should have noticed it during the review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154671
Approved by: https://github.com/dcci
ghstack dependencies: #154670
2025-05-30 00:28:13 +00:00
0134150ebb [MPS][BE] Do not copy sizes/strides unnecesserily (#154670)
Just pass them as args to `mtl_setArgs`, metaprogramming should deal with the rest
Also use `mtl_dispatch1DJob` instead of computing max threadgroup size by nand

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154670
Approved by: https://github.com/dcci
2025-05-30 00:28:13 +00:00
61bfb3df9f [a2av] Improve tuning for 4 GPUs (#154580)
### Problem
Running `nvshmem_all_to_all_vdev` on 4 x H100s (fully connected with NVSwitch).
Before:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  32.29  16.23
1  33.01  31.76
2  33.01  63.54
4  33.83  123.97
8  49.83  168.34
16  80.82  207.59
32  178.66  187.82
64  335.79  199.86
128  646.72  207.54
256  1268.77  211.57
512  2511.14  213.80
1024  4998.31  214.82
2048  9964.49  215.51
4096  19892.34  215.91
```

215 GB/s does not reach the SOL of NV18 (350-400 GB/s).

### Change
If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension.

After:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  25.01  20.96
1  25.70  40.80
2  25.76  81.42
4  28.87  145.26
8  40.79  205.64
16  61.46  272.97
32  111.82  300.06
64  202.40  331.57
128  382.56  350.84
256  739.11  363.19
512  1450.79  370.05
1024  2873.13  373.72
2048  5719.50  375.47
4096  11395.65  376.90
```

If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154580
Approved by: https://github.com/ngimel
2025-05-30 00:26:13 +00:00
2c1cb38d95 inductor codecache: include private inductor configs in cache key (#153672)
Fixes https://github.com/pytorch/torchtitan/issues/1185

It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.

I'm not sure how worried we should be on the blast radius of this change. On the one hand:

(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)

(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen
ghstack dependencies: #153766
2025-05-30 00:24:29 +00:00
5b6fd277f9 convert inductor codecache to use getArtifactLogger (#153766)
I'm not entirely sure of the background for why inductor codecache code uses default python logging instead of the new TORCH_LOGS-based artifact logging, but switching it over to artifact logging makes it easier to use nice testing utils in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153766
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2025-05-30 00:24:29 +00:00
eqy
818f76a745 [cuDNN] Allow cudnn attention or flash attention in test_export.py regex (#154458)
Analogous to #153272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154458
Approved by: https://github.com/drisspg
2025-05-29 23:51:09 +00:00
dc0f09a478 Enable C++ dynamic shape guards by default (#140756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756
Approved by: https://github.com/anijain2305, https://github.com/laithsakka
ghstack dependencies: #151225
2025-05-29 23:44:43 +00:00
0c6c7780d9 [Inductor] Add envvar to disable decomposeK (#154421)
Summary: Add envvar to Inductor config to disable decomposeK autotuning choice

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_max_autotune_decompose_k_dynamic_False_sizes2 (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`

Reviewed By: eellison

Differential Revision: D75174823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154421
Approved by: https://github.com/eellison
2025-05-29 23:34:41 +00:00
9ba67e99bb [dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225
Approved by: https://github.com/anijain2305
2025-05-29 23:29:39 +00:00
d5e0704247 [ROCm] Update maxpool launch config (#154619)
* Better perf on MI300 with updated launch configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154619
Approved by: https://github.com/jeffdaily
2025-05-29 23:28:07 +00:00
43b18d098b Forward fix for test_frame_traced_hook in internal testing (#154641)
Summary: Fixes the newly-added dynamo test test_frame_traced_hook so it can run internally

Test Plan: This is a test change

Differential Revision: D75616787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154641
Approved by: https://github.com/Skylion007
2025-05-29 23:02:01 +00:00
b040d63ce4 Prevent SAC cache from being kept alive by reference cycle (#154651)
Fixes https://github.com/pytorch/pytorch/issues/154642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154651
Approved by: https://github.com/xmfan
2025-05-29 22:27:35 +00:00
7d17253af8 [BE]: Improve aten formatter with fmtlib (#152830)
Replaces stateful ostream output with stateless fmtlib, which is signficantly faster and more contained. It is especially faster for the type of complex double formatting found here since it uses the newer [DragonBox algorithm](https://github.com/jk-jeon/dragonbox) for faster floating point formatting (which is the main bottleneck here). This also enables some static time checking of the formatting strings

test plan: all tests pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152830
Approved by: https://github.com/cyyever, https://github.com/malfet, https://github.com/atalman
2025-05-29 22:11:30 +00:00
fdbf314278 [Inductor] Cache subgraph autotuning choices properly (#154067)
Differential Revision: D75170507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154067
Approved by: https://github.com/eellison
2025-05-29 22:01:44 +00:00
c7e8e8ee19 Add torch.profile benchmarking function to feedback_fns (#153579)
Summary: Updates some benchmarking code to have the option to use torch.profile, and passes in a thunk to benchmark_fns to get this information (this will be a different result from `timings`, which are already passed into those functions).

Test Plan: Existing unit tests.

Differential Revision: D74444990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153579
Approved by: https://github.com/coconutruben, https://github.com/masnesral, https://github.com/nmacchioni
2025-05-29 21:43:45 +00:00
1237f271aa [ROCm] MIOpen: Get current device from Torch rather than HIP in handle creation (#154549)
Get current device from Torch rather than HIP in MIOpen handle creation. The device may have already been set from torch side, otherwise device is set to 0 for handle.  Additional audits of cudnn vs miopen Handle.cpp file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154549
Approved by: https://github.com/jeffdaily, https://github.com/cyyever

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-29 21:12:12 +00:00
08fdc64c86 [ROCm] Exposing Some MIOpen Symbols (#2176) (#154545)
This PR exposes some MIOpen symbols, namely:

1. `miopenDataType_t getMiopenDataType(const at::Tensor& tensor)`
2. `miopenHandle_t getMiopenHandle()`
3. `class TensorDescriptor`
4. `class Descriptor`
5. `class FilterDescriptor`
6. `struct ConvolutionDescriptor`
7. `struct DropoutDescriptor`
8. `struct RNNDescriptor`

to enable adding extensions that make use of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154545
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-29 21:10:45 +00:00
83a0e4e6f9 [Visualizer] Start at index with most events (#154571)
Summary: Oftentimes a single snapshot will contain multiple GPU traces in it based on what the process can see. In this case lets just start with the gpu trace with the highest amount of activity

Test Plan:
Ran od with: https://www.35929.od.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-chujiechen-f701302011/1/rank-1_itrn-3.Mar_01_06_10_09.3747.snapshot.pickle
And it started at index 1 instead of 0

Differential Revision: D75555558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154571
Approved by: https://github.com/aaronenyeshi
2025-05-29 20:49:33 +00:00
2bc8fec744 deprecate MTIA_WORKLOADD from pytorch (#154627)
Differential Revision: D75612179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154627
Approved by: https://github.com/sraikund16
2025-05-29 20:30:40 +00:00
cb56df55dc [Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331)
Fixes #153298

This PR is the 3rd and final step of #147479
All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated.
All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary.

[henrylhtsang](https://github.com/henrylhtsang)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331
Approved by: https://github.com/henrylhtsang
2025-05-29 20:29:58 +00:00
629fca295e Always set CPU affinity for benchmark jobs (#154569)
Because metrics like compilation time requires CPU.  I want to see if this help fix https://github.com/pytorch/pytorch/issues/152566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154569
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-29 20:11:47 +00:00
3afbab66f7 [BE] Remove unused release scripts. Add clarifications for the branch cut process (#154649)
Scripts in ``scripts/release/promote/`` are not used for a while.
We use the ones in test-infra [here](https://github.com/pytorch/test-infra/blob/main/release/) .
Hence this small cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154649
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2025-05-29 19:49:37 +00:00
e8f5c24d17 [rocm]add device guard when initialize single stream (#154433)
Summary: AMD streams are lazily initialized and sometimes (e.g. when we just want to do event recording on the stream) we might not be setting the device guard while it's initializing which would lead to invalid configuration error.

Differential Revision: D75456460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154433
Approved by: https://github.com/jeffdaily
2025-05-29 19:42:12 +00:00
20ec61a02f [BE] fix lint errors caused by const SROpFunctor fn (#154552)
Summary: Remove const quaiflier from SR suggsted from CLANGTIDY.

Test Plan: arc lint -a -e extra --take CLANGTIDY caffe2/torch/fb/sparsenn/cpu_operators/to_dense_representation_cpu.cpp

Reviewed By: henryoier

Differential Revision: D75534056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154552
Approved by: https://github.com/Skylion007
2025-05-29 19:40:08 +00:00
5a21d6f982 [AOTI][reland] Support multi-arch when using package_cpp_only (#154608)
Summary: Reland https://github.com/pytorch/pytorch/pull/154414

Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154608
Approved by: https://github.com/yushangdi
2025-05-29 19:32:33 +00:00
0db9c64d68 [c10d] Separate monitoring thread into a class in PGNCCL (#153977)
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.

We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.

What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.

Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
2025-05-29 17:45:04 +00:00
6f992e1b3f [BE][AT] cleanup my old todo (#154542)
Summary: this todo is very old, and probably not needed anymore. let's have CI figure out if removing this breaks anything

Test Plan: CI

Differential Revision: D75491068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154542
Approved by: https://github.com/Skylion007
2025-05-29 17:22:01 +00:00
634ce22601 [MPSInductor] Fix codegen for nested multistage reductions (#154578)
Yet to write a unittest for it, but this fixes codegen for
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5  --backend inductor --inference --devices mps --float16
```

By correctly closing triple nested loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154578
Approved by: https://github.com/jansel, https://github.com/dcci
2025-05-29 17:09:25 +00:00
8883e494b3 [cutlass backend][ez] remove indent for cutlass config serialization (#154573)
Differential Revision: [D75566642](https://our.internmc.facebook.com/intern/diff/D75566642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154573
Approved by: https://github.com/ColinPeppler
2025-05-29 17:00:52 +00:00
41092cb86c [MPS] index copy impl (#154326)
Second most requested op according to #154052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326
Approved by: https://github.com/malfet
2025-05-29 16:57:43 +00:00
733e684b11 Skip test file that doesn't run gradcheck for slow gradcheck (#154509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154509
Approved by: https://github.com/malfet
2025-05-29 16:32:26 +00:00
2c6f24c62d [ROCm] Updated default workspace for gfx95 (#153988)
Fixes test_cuda.py::test_cublas_workspace_explicit_allocation on gfx95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153988
Approved by: https://github.com/jeffdaily
2025-05-29 16:22:17 +00:00
53b0f6f543 Revert "Use 3.27 as the minimum CMake version (#153153)"
This reverts commit 4613081b729273a9273185e9ef7470ce76e22da2.

Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/malfet due to It broke windows debug builds, see ef1d45b12d/1 ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2919897160))
2025-05-29 16:14:28 +00:00
ef1d45b12d Cleanup parent fallback logic (#154006)
The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006
Approved by: https://github.com/drisspg
2025-05-29 13:40:36 +00:00
d6e29bf875 Reflect back mutation if we clone misaligned tensors (#154442)
Fix for https://github.com/pytorch/pytorch/issues/152425

inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor.

If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation.

We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark:
```
import torch

t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16)

@torch.compile(dynamic=False)
def foo(x):
    return x.add_(1)

import triton

print(triton.testing.do_bench(lambda: foo(t[:-1])))
torch._dynamo.reset()
print(triton.testing.do_bench(lambda: foo(t[1:])))
```
gives
```
0.04063070610165596
0.07613472988113162
```
So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case.

In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-05-29 13:36:48 +00:00
3c74a72ea0 Keep XPU compatible with toolchain 2025.2 (#154359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154359
Approved by: https://github.com/EikanWang, https://github.com/cyyever
2025-05-29 11:12:07 +00:00
cd9ff41282 check fallback_value first. (#154493)
This is just a refactor, not a fix for any issue.
we do check fallback_value first  and early exit instead of checking it not set over and over.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154493
Approved by: https://github.com/bobrenjc93
2025-05-29 09:06:43 +00:00
447b481c79 [AOTI] Save data sizes to constants_info (#154534)
Differential Revision: D75223179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154534
Approved by: https://github.com/muchulee8
2025-05-29 06:39:13 +00:00
9c7ed3e46e [debug_printer][BE] Fix float8 type printing for min/max value printing (#154466)
Summary:
ATT

GH Issue: https://github.com/pytorch/pytorch/issues/149008

**Previous:**
Failed to use debug printing for float8 types due to the limitation of "min_all_cuda" implementation from aten native:

 4b39832412/aten/src/ATen/native/cuda/ReduceMinValuesKernel.cu (L51)

Error:

Min value: Error: "min_all_cuda" not implemented for 'Float8_e4m3fn'

**Now:**
Example output paste: P1824621233
Unblocked float8 type tensor debug printing. Suggest to print the whole value if numel <= threshold.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_C
OMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_float8_dtype_cuda
```

```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_fp8_cuda 2>&1 | tee fp8_example_printing.txt
```

Differential Revision: D74847967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154466
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang
2025-05-29 05:48:02 +00:00
07343efc15 [cutlass backend] small refactor to flatten the ops to avoid nested for loops (#154576)
Differential Revision: [D75565429](https://our.internmc.facebook.com/intern/diff/D75565429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154576
Approved by: https://github.com/ColinPeppler
2025-05-29 04:42:58 +00:00
b394c6e89c [Inductor][CPP] Add block sparse for FlexAttention CPU (#147196)
## Overview
This PR is to optimize FlexAttention CPP template with block sparse.
Block sparse is natively supported in FlexAttention block mask structures, thus following logic of the kv blocks from `kv_indice ` and `full_kv_indice ` is the strightforward way to add this optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147196
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel
2025-05-29 02:57:02 +00:00
c0864bb389 Add a (t * 0) pattern (#153161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153161
Approved by: https://github.com/danielvegamyhre
2025-05-29 02:19:36 +00:00
316e7a9293 [BE][Ez]: Denote common types as TypeAlias (#154527)
Denotes common_types as TypeAlias. This triggered a Ruff rule since we named our TypeAlias off standards so I added a file wide ruff suppression
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154527
Approved by: https://github.com/benjaminglass1, https://github.com/aorenste
2025-05-29 02:00:13 +00:00
2d932a2e01 [ROCm] Fix 3D tensor perf degradation with NHWC format (#154522)
Co-author: @doru1004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154522
Approved by: https://github.com/jeffdaily
2025-05-29 01:33:49 +00:00
cyy
4613081b72 Use 3.27 as the minimum CMake version (#153153)
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-29 00:52:44 +00:00
946a4c2bdc BE: Type previously untyped decorators (#154515)
Summary: Cloned #153726 from Skylion007 and fixed internal typing issues.

Test Plan: Unit tests pass

Differential Revision: D75477355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154515
Approved by: https://github.com/Skylion007
2025-05-29 00:36:34 +00:00
ba0a91b3ea [4/n][Optimus][Auto-AC] Expose the config to skip the dynamo gaurds to avoid recompile (#154152)
Summary:
context: https://fb.workplace.com/groups/1075192433118967/permalink/1673720956599442/

Thanks Microve for raising the existing dynamo skip API in D75196435

The dynamic shape triggers recompilation, introducing compilation time increase, we expose config that users can skip the dynamo guards to avoid the recompile. Note that it may quantize unnessarily nodes, which can impact NE, QPS and memory saving,  needs verification.

Differential Revision: D75248430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154152
Approved by: https://github.com/bobrenjc93
2025-05-29 00:35:37 +00:00
22a1b3b5d0 use 4 elements per thread in no-cast elementwise kernel (#154558)
Reduce elems per thread to 4 in vectorized function also (only for unaligned inputs where there's no vectorization anyway). This slightly reduces binary size (by 4MB)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154558
Approved by: https://github.com/malfet
2025-05-29 00:32:44 +00:00
40abb2b403 Fix deprecated amp APIs in docs (#154553)
Update usage of deprecated amp APIs.

Fixes https://github.com/pytorch/tutorials/issues/3331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154553
Approved by: https://github.com/Skylion007
2025-05-29 00:05:59 +00:00
2b3ac17aa2 [Cutlass] Remove spammy log for gemm extensions (#154548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154548
Approved by: https://github.com/henrylhtsang
2025-05-28 23:55:36 +00:00
81b7c96697 [dynamo, nested graph breaks] add skip_frame debugging function (#153773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510, #153772
2025-05-28 23:29:37 +00:00
6cda280483 [dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510
2025-05-28 23:29:37 +00:00
bbd45f1f1f [dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)
Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510
Approved by: https://github.com/jansel
ghstack dependencies: #151056
2025-05-28 23:29:37 +00:00
0f0d5749a0 [dynamo, nested graph breaks] small fixes to resume function generation (#151056)
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~

We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
2025-05-28 23:29:37 +00:00
65b1aedd09 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-28 23:25:17 +00:00
3e05a48927 Fix clamp type promotion in inductor decomposition (#154471)
Summary: as title, the clamp type promotion should take min/max arg into consideration as well.

Test Plan:
```
buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu
python test/inductor/test_torchinductor.py -k test_clamp -v
```

Differential Revision: D75490124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2025-05-28 23:24:25 +00:00
d865b784e4 Support unbacked whitelist (#154295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154295
Approved by: https://github.com/angelayi
2025-05-28 23:01:22 +00:00
4124 changed files with 110513 additions and 48544 deletions

View File

@ -3,10 +3,8 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
@ -27,6 +25,7 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -88,7 +88,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
@ -108,9 +108,9 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "128" in desired_cuda:
if "129" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]

View File

@ -1,4 +1,4 @@
ARG CUDA_VERSION=12.4
ARG CUDA_VERSION=12.6
ARG BASE_TARGET=cuda${CUDA_VERSION}
ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete
FROM amd64/almalinux:8.10-20250519 as base
@ -52,10 +52,6 @@ ENV CUDA_VERSION=${CUDA_VERSION}
# Make things in our path by default
ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
@ -64,6 +60,10 @@ FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
ENV DESIRED_CUDA=12.8
FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
ENV DESIRED_CUDA=12.9
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ADD ./common/install_mkl.sh install_mkl.sh
@ -78,7 +78,8 @@ RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
COPY --from=cuda12.4 /usr/local/cuda-12.8 /usr/local/cuda-12.8
COPY --from=cuda12.8 /usr/local/cuda-12.8 /usr/local/cuda-12.8
COPY --from=cuda12.9 /usr/local/cuda-12.9 /usr/local/cuda-12.9
# Final step
FROM ${BASE_TARGET} as final

View File

@ -50,30 +50,21 @@ if [[ "$image" == *xla* ]]; then
exit 0
fi
if [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
elif [[ "$image" == *-jammy* ]]; then
if [[ "$image" == *-jammy* ]]; then
UBUNTU_VERSION=22.04
elif [[ "$image" == *ubuntu* ]]; then
extract_version_from_image_name ubuntu UBUNTU_VERSION
elif [[ "$image" == *centos* ]]; then
extract_version_from_image_name centos CENTOS_VERSION
fi
if [ -n "${UBUNTU_VERSION}" ]; then
OS="ubuntu"
elif [ -n "${CENTOS_VERSION}" ]; then
OS="centos"
else
echo "Unable to derive operating system base..."
exit 1
fi
DOCKERFILE="${OS}/Dockerfile"
# When using ubuntu - 22.04, start from Ubuntu docker image, instead of nvidia/cuda docker image.
if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
if [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *xpu* ]]; then
DOCKERFILE="${OS}-xpu/Dockerfile"
@ -98,8 +89,8 @@ tag=$(echo $image | awk -F':' '{print $2}')
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$tag" in
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
@ -110,7 +101,7 @@ case "$tag" in
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -121,7 +112,31 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
@ -168,8 +183,8 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -179,25 +194,25 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
pytorch-linux-jammy-py3-clang12-onnx)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
CLANG_VERSION=12
VISION=yes
ONNX=yes
;;
pytorch-linux-focal-py3.9-clang10)
pytorch-linux-jammy-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-focal-py3.11-clang10)
pytorch-linux-jammy-py3.11-clang12)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=10
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-focal-py3.9-gcc9)
pytorch-linux-jammy-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
VISION=yes
@ -252,9 +267,9 @@ case "$tag" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
CLANG_VERSION=12
VISION=yes
@ -303,15 +318,15 @@ case "$tag" in
GCC_VERSION=11
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
@ -370,14 +385,6 @@ esac
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if [[ ${CUDNN_VERSION} == 9 ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
fi
fi
no_cache_flag=""
progress_flag=""
# Do not use cache and progress=plain when in CI
@ -394,7 +401,6 @@ docker build \
--build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "VISION=${VISION:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \
--build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \

View File

@ -39,6 +39,7 @@ RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt

View File

@ -1 +1 @@
b173722085b3f555d6ba4533d6bbaddfd7c71144
56392aa978594cc155fa8af48cd949f5b5f1823a

View File

@ -1 +1 @@
v2.26.5-1
v2.27.3-1

View File

@ -1 +1 @@
b0e26b7359c147b8aa0af686c20510fb9b15990a
ae324eeac8e102a2b40370e341460f3791353398

View File

@ -1 +1 @@
c8757738a7418249896224430ce84888e8ecdd79
ae848267bebc65c6181e8cc5e64a6357d2679260

View File

@ -30,18 +30,6 @@ install_ubuntu() {
maybe_libomp_dev=""
fi
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
# TODO: Eliminate this hack, we should not relay on apt-get installation
# See https://github.com/pytorch/pytorch/issues/144768
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
@ -70,7 +58,6 @@ install_ubuntu() {
libasound2-dev \
libsndfile-dev \
${maybe_libomp_dev} \
${maybe_libnccl_dev} \
software-properties-common \
wget \
sudo \

View File

@ -6,7 +6,7 @@ set -ex
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
@ -64,6 +64,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge
# Miniforge installer doesn't install sqlite by default
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
conda_install sqlite
fi
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.29=*openmp*"

View File

@ -40,37 +40,9 @@ function install_cudnn {
rm -rf tmp_cudnn
}
function install_118 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.4.0"
install_cuda 11.8.0 cuda_11.8.0_520.61.05_linux
install_cudnn 11 $CUDNN_VERSION
CUDA_VERSION=11.8 bash install_nccl.sh
CUDA_VERSION=11.8 bash install_cusparselt.sh
ldconfig
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.2"
install_cuda 12.4.1 cuda_12.4.1_550.54.15_linux
install_cudnn 12 $CUDNN_VERSION
CUDA_VERSION=12.4 bash install_nccl.sh
CUDA_VERSION=12.4 bash install_cusparselt.sh
ldconfig
}
function install_126 {
CUDNN_VERSION=9.5.1.17
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux
install_cudnn 12 $CUDNN_VERSION
@ -82,69 +54,20 @@ function install_126 {
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
# CUDA 11.8 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"
function install_129 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.9.1 in the same container
install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux
export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
CUDA_VERSION=12.9 bash install_nccl.sh
# all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
CUDA_VERSION=12.9 bash install_cusparselt.sh
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 11.8 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-11.8/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
ldconfig
}
function prune_126 {
@ -183,7 +106,7 @@ function prune_126 {
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.8.1 in the same container
install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux
@ -201,13 +124,11 @@ function install_128 {
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
12.6|12.6.*) install_126; prune_126
;;
12.4) install_124; prune_124
12.8|12.8.*) install_128;
;;
12.6) install_126; prune_126
;;
12.8) install_128;
12.9|12.9.*) install_129;
;;
*) echo "bad argument $1"; exit 1
;;

View File

@ -4,12 +4,10 @@ if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"
if [[ ${CUDA_VERSION:0:4} == "12.9" || ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
CUDNN_NAME="cudnn-linux-x86_64-9.10.2.21_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else

View File

@ -5,25 +5,14 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-9]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.7.1.0-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi

View File

@ -8,16 +8,6 @@ retry () {
"$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")
}
# A bunch of custom pip dependencies for ONNX
pip_install \
beartype==0.15.0 \
filelock==3.9.0 \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.5 \
numpy==1.24.2
# ONNXRuntime should be installed before installing
# onnx-weekly. Otherwise, onnx-weekly could be
# overwritten by onnx.
@ -29,11 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnxscript==0.2.6 --no-deps
# required by onnxscript
pip_install ml_dtypes
pip_install onnxscript==0.3.1
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -4,8 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.29}" --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128

View File

@ -26,6 +26,11 @@ Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF
# we want the patch version of 6.4 instead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
ROCM_VERSION="${ROCM_VERSION}.1"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
@ -67,19 +72,23 @@ EOF
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
VER_PATCH=.1
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

View File

@ -51,7 +51,12 @@ as_jenkins git clone --recursive ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
# Old versions of python have setup.py in ./python; newer versions have it in ./
if [ ! -f setup.py ]; then
cd python
fi
pip_install pybind11==2.13.6
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
@ -93,3 +98,6 @@ fi
if [ -n "${NUMPY_VERSION}" ]; then
pip_install "numpy==${NUMPY_VERSION}"
fi
if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then
pip_install helion
fi

View File

@ -54,16 +54,6 @@ COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
COPY ./common/install_cusparselt.sh install_cusparselt.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
RUN bash ./install_magma.sh 12.6
@ -74,6 +64,11 @@ RUN bash ./install_cuda.sh 12.8
RUN bash ./install_magma.sh 12.8
RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda
FROM cuda as cuda12.9
RUN bash ./install_cuda.sh 12.9
RUN bash ./install_magma.sh 12.9
RUN ln -sf /usr/local/cuda-12.9 /usr/local/cuda
FROM cpu as rocm
ARG ROCM_VERSION
ARG PYTORCH_ROCM_ARCH

View File

@ -26,7 +26,7 @@ ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
@ -103,6 +103,7 @@ ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /usr/local/lib/ /usr/local/lib/
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel

View File

@ -2,7 +2,7 @@ FROM quay.io/pypa/manylinux_2_28_aarch64 as base
ARG GCCTOOLSET_VERSION=13
# Language variabes
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
@ -58,12 +58,13 @@ RUN git config --global --add safe.directory "*"
FROM base as openblas
# Install openblas
ARG OPENBLAS_VERSION
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM base as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -60,7 +60,7 @@ RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -120,11 +120,13 @@ RUN python3 -mpip install cmake==3.28.0
# so just build it from upstream repository.
# h5py is dependency of onnxruntime_training.
# h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
# h5py 3.11.0 doesn't build with numpy >= 2.3.0.
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install flatbuffers && \
pip3 install h5py==3.11.0 && \
pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
pip3 install --no-build-isolation h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \

View File

@ -27,6 +27,7 @@ fi
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
OPENBLAS_VERSION=${OPENBLAS_VERSION:-}
case ${image} in
manylinux2_28-builder:cpu)
@ -40,6 +41,7 @@ case ${image} in
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.29"
;;
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
@ -109,6 +111,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \

View File

@ -41,14 +41,11 @@ fbscribelogger==0.1.7
#Pinned versions: 0.1.6
#test that import:
flatbuffers==2.0 ; platform_machine != "s390x"
flatbuffers==24.12.23
#Description: cross platform serialization library
#Pinned versions: 2.0
#Pinned versions: 24.12.23
#test that import:
flatbuffers ; platform_machine == "s390x"
#Description: cross platform serialization library; Newer version is required on s390x for new python version
hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
@ -93,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.15.0
mypy==1.16.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -342,7 +339,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.2.6
onnxscript==0.3.1
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -382,3 +379,10 @@ dataclasses_json==0.6.7
cmake==4.0.0
#Description: required for building
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py

View File

@ -1 +1 @@
3.3.1
3.4.0

View File

@ -0,0 +1 @@
3.4.0

View File

@ -1,170 +0,0 @@
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ARG IMAGE_NAME
FROM ${IMAGE_NAME} as base
ARG UBUNTU_VERSION
ARG CUDA_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc
ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install clang
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install vision packages like OpenCV
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
ENV UCC_HOME /usr
ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
FROM base as triton-builder
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN bash ./install_triton.sh
FROM base as final
COPY --from=triton-builder /opt/triton /opt/triton
RUN if [ -n "${TRITON}" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
RUN rm -rf /opt/triton
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
# See https://github.com/pytorch/pytorch/issues/82174
# TODO(sdym@fb.com):
# check if this is needed after full off Xenial migration
ENV CARGO_NET_GIT_FETCH_WITH_CLI true
RUN bash ./install_cache.sh && rm install_cache.sh
ENV CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
# Add jni.h for java host build
COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
# Install Open MPI for CUDA
COPY ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV CUDA_PATH /usr/local/cuda
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
# Install CUDNN
ARG CUDNN_VERSION
ARG CUDA_VERSION
COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
ARG CUDA_VERSION
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Install NCCL
ARG CUDA_VERSION
COPY ./common/install_nccl.sh install_nccl.sh
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
RUN bash install_nccl.sh
RUN rm install_nccl.sh /ci_commit_pins/nccl-cu*
ENV USE_SYSTEM_NCCL=1
ENV NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
ENV NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Install CUDSS
ARG CUDA_VERSION
COPY ./common/install_cudss.sh install_cudss.sh
RUN bash install_cudss.sh
RUN rm install_cudss.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi
RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi
USER jenkins
CMD ["bash"]

View File

@ -25,6 +25,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt

View File

@ -72,7 +72,7 @@ ARG TRITON
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-xpu.txt triton-xpu.txt
COPY triton_version.txt triton_version.txt
COPY triton_xpu_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt

View File

@ -1,7 +1,7 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_CUDA ?= 11.8
DESIRED_CUDA ?= 12.8
DESIRED_CUDA_SHORT = $(subst .,,$(DESIRED_CUDA))
PACKAGE_NAME = magma-cuda
CUDA_ARCH_LIST ?= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
@ -16,15 +16,21 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma/build_magma.sh
.PHONY: all
all: magma-cuda129
all: magma-cuda128
all: magma-cuda126
all: magma-cuda118
.PHONY:
clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda129
magma-cuda129: DESIRED_CUDA := 12.9
magma-cuda129: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda129:
$(DOCKER_RUN)
.PHONY: magma-cuda128
magma-cuda128: DESIRED_CUDA := 12.8
magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
@ -35,9 +41,3 @@ magma-cuda128:
magma-cuda126: DESIRED_CUDA := 12.6
magma-cuda126:
$(DOCKER_RUN)
.PHONY: magma-cuda118
magma-cuda118: DESIRED_CUDA := 11.8
magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37
magma-cuda118:
$(DOCKER_RUN)

View File

@ -31,7 +31,6 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
@ -98,6 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in
@ -151,7 +151,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake
CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \

View File

@ -15,6 +15,9 @@ export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
export USE_CUFILE=${USE_CUFILE:-1}
export USE_SYSTEM_NCCL=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
@ -48,20 +51,23 @@ else
fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
#removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
#however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"
;;
12.9)
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"
# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
;;
*)
echo "unknown cuda version $CUDA_VERSION"
@ -104,12 +110,11 @@ DEPS_SONAME=(
)
# CUDA_VERSION 12.6, 12.8
# CUDA_VERSION 12.6, 12.8, 12.9
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
@ -125,7 +130,6 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
@ -144,7 +148,6 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libcufile.so.0"
@ -162,6 +165,7 @@ if [[ $CUDA_VERSION == 12* ]]; then
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/cusparselt/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
@ -172,94 +176,9 @@ if [[ $CUDA_VERSION == 12* ]]; then
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
elif [[ $CUDA_VERSION == "11.8" ]]; then
export USE_STATIC_CUDNN=0
# Turn USE_CUFILE off for CUDA 11.8 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
export USE_CUFILE=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750
export BUILD_BUNDLE_PTXAS=1
# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary
# since nvidia-cusparselt-cu11 is not available in PYPI
if [[ $USE_CUSPARSELT == "1" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.11"
"/usr/local/cuda/lib64/libcublasLt.so.11"
"/usr/local/cuda/lib64/libcudart.so.11.0"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.11.2" # this is not a mistake, it links to more specific cuda version
"/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.11"
"libcublasLt.so.11"
"libcudart.so.11.0"
"libnvToolsExt.so.1"
"libnvrtc.so.11.2"
"libnvrtc-builtins.so.11.8"
)
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
else
echo "Unknown cuda version $CUDA_VERSION"

View File

@ -92,6 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1

View File

@ -95,6 +95,7 @@ ROCM_SO_FILES=(
"libroctracer64.so"
"libroctx64.so"
"libhipblaslt.so"
"libhipsparselt.so"
"libhiprtc.so"
)
@ -186,20 +187,28 @@ do
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; separated arch list to bar for grep
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep
ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
ROCBLAS_ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
ROCBLAS_OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ROCBLAS_ARCH_SPECIFIC_FILES $OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library
HIPBLASLT_LIB_DST=lib/hipblaslt/library
ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
HIPBLASLT_ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
HIPBLASLT_OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($HIPBLASLT_ARCH_SPECIFIC_FILES $HIPBLASLT_OTHER_FILES)
# hipsparselt library files
HIPSPARSELT_LIB_SRC=$ROCM_HOME/lib/hipsparselt/library
HIPSPARSELT_LIB_DST=lib/hipsparselt/library
HIPSPARSELT_ARCH_SPECIFIC_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -E $ARCH)
#HIPSPARSELT_OTHER_FILES=$(ls $HIPSPARSELT_LIB_SRC | grep -v gfx)
HIPSPARSELT_LIB_FILES=($HIPSPARSELT_ARCH_SPECIFIC_FILES $HIPSPARSELT_OTHER_FILES)
# ROCm library files
ROCM_SO_PATHS=()
@ -234,12 +243,14 @@ DEPS_SONAME=(
DEPS_AUX_SRCLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"
"${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_SRC/}"
"/opt/amdgpu/share/libdrm/amdgpu.ids"
)
DEPS_AUX_DSTLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"
"${HIPSPARSELT_LIB_FILES[@]/#/$HIPSPARSELT_LIB_DST/}"
"share/libdrm/amdgpu.ids"
)

View File

@ -27,6 +27,12 @@ cmake --version
echo "Environment variables:"
env
# The sccache wrapped version of nvcc gets put in /opt/cache/lib in docker since
# there are some issues if it is always wrapped, so we need to add it to PATH
# during CI builds.
# https://github.com/pytorch/pytorch/blob/0b6c0898e6c352c8ea93daec854e704b41485375/.ci/docker/common/install_cache.sh#L97
export PATH="/opt/cache/lib:$PATH"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
# Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
@ -52,12 +58,6 @@ fi
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then
# To build test_edge_op_registration
export BUILD_EXECUTORCH=ON
export USE_CUDA=0
fi
if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with
# intel cpu and later run tests on machines with amd cpu.
@ -257,6 +257,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e -o pipefail
get_bazel
python3 tools/optional_submodules.py checkout_eigen
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner

View File

@ -313,7 +313,7 @@ if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
# Please see issue for reference: https://github.com/pytorch/pytorch/issues/152426
if [[ "$(uname -m)" == "s390x" ]]; then
cxx_abi="19"
elif [[ "$DESIRED_CUDA" != 'cu118' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then
elif [[ "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'rocm'* ]]; then
cxx_abi="18"
else
cxx_abi="16"

View File

@ -15,6 +15,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
export PYTORCH_TEST_WITH_ROCM=1
fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# TODO: Reenable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# shellcheck disable=SC2034
BUILD_TEST_LIBTORCH=0

View File

@ -159,11 +159,6 @@ function install_torchvision() {
fi
}
function install_tlparse() {
pip_install --user "tlparse==0.3.30"
PATH="$(python -m site --user-base)/bin:$PATH"
}
function install_torchrec_and_fbgemm() {
local torchrec_commit
torchrec_commit=$(get_pinned_commit torchrec)
@ -202,7 +197,7 @@ function install_torchrec_and_fbgemm() {
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.8 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

View File

@ -5,11 +5,6 @@ set -x
# shellcheck source=./macos-common.sh
source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"
if [[ -n "$CONDA_ENV" ]]; then
# Use binaries under conda environment
export PATH="$CONDA_ENV/bin":$PATH
fi
# Test that OpenMP is enabled
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
@ -233,53 +228,52 @@ test_torchbench_smoketest() {
mkdir -p "$TEST_REPORTS_DIR"
local device=mps
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor)
local hf_models=(GoogleFnet YituTechConvBert Speech2Text2ForCausalLM)
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
for backend in eager inductor; do
for dtype in notset float16 bfloat16; do
echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
echo "Launching torchbench inference performance run for backend ${backend} and dtype ${dtype}"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
dtype_arg="--float32"
fi
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
for model in "${hf_models[@]}"; do
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
fi
for dtype in notset amp; do
echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype}"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
local dtype_arg="--${dtype}"
if [ "$dtype" == notset ]; then
if [ "$dtype" == notset ]; then
for dtype_ in notset amp; do
echo "Launching torchbench training performance run for backend ${backend} and dtype ${dtype_}"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv"
local dtype_arg="--${dtype_}"
if [ "$dtype_" == notset ]; then
dtype_arg="--float32"
fi
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv" || true
fi
for model in "${models[@]}"; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype_}_training_${device}_performance.csv" || true
done
done
done
fi
done
@ -318,8 +312,6 @@ test_timm_perf() {
echo "timm benchmark on mps device completed"
}
install_tlparse
if [[ $TEST_CONFIG == *"perf_all"* ]]; then
test_torchbench_perf
test_hf_perf
@ -331,7 +323,7 @@ elif [[ $TEST_CONFIG == *"perf_hf"* ]]; then
elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest
test_torchbench_smoketest "${SHARD_NUMBER}"
elif [[ $TEST_CONFIG == *"mps"* ]]; then
test_python_mps
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

View File

@ -93,7 +93,7 @@ def check_lib_symbols_for_abi_correctness(lib: str) -> None:
f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"
)
if num_cxx11_symbols < 100:
raise RuntimeError("Didn't find enought cxx11 symbols")
raise RuntimeError("Didn't find enough cxx11 symbols")
def main() -> None:

View File

@ -46,6 +46,9 @@ def get_gomp_thread():
# use the default gomp path of AlmaLinux OS
libgomp_path = "/usr/lib64/libgomp.so.1"
# if it does not exist, try Ubuntu path
if not os.path.exists(libgomp_path):
libgomp_path = f"/usr/lib/{os.uname().machine}-linux-gnu/libgomp.so.1"
os.environ["GOMP_CPU_AFFINITY"] = "0-3"

View File

@ -276,7 +276,7 @@ def smoke_test_cuda(
torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())
print(f"Torch nccl; version: {torch_nccl_version}")
# Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.
# Pypi dependencies are installed on linux only and nccl is available only on Linux.
if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:
compare_pypi_to_torch_versions(
"cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

View File

@ -196,7 +196,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Check XPU status before testing
xpu-smi discovery
timeout 30 xpu-smi discovery || true
fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
@ -212,8 +212,6 @@ if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export VALGRIND=OFF
fi
install_tlparse
# DANGER WILL ROBINSON. The LD_PRELOAD here could cause you problems
# if you're not careful. Check this if you made some changes and the
# ASAN test is not working
@ -226,7 +224,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
# TODO: Figure out how to avoid hard-coding these paths
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-15/bin/llvm-symbolizer
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-18/bin/llvm-symbolizer
export TORCH_USE_RTLD_GLOBAL=1
# NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our
# default behavior.
@ -324,6 +322,17 @@ test_python_smoke() {
assert_git_not_dirty
}
test_h100_distributed() {
# Distributed tests at H100
time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
# This test requires multicast support
time python test/run_test.py --include distributed/_composable/fsdp/test_fully_shard_comm.py -k TestFullyShardAllocFromPG $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
# symmetric memory test
time python test/run_test.py --include distributed/test_symmetric_memory.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
@ -352,6 +361,17 @@ test_dynamo_wrapped_shard() {
assert_git_not_dirty
}
test_einops() {
pip install einops==0.6.1
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
pip install einops==0.7.0
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
pip install einops==0.8.1
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
assert_git_not_dirty
}
test_inductor_distributed() {
# Smuggle a few multi-gpu tests here so that we don't have to request another large node
echo "Testing multi_gpu tests in test_torchinductor"
@ -590,7 +610,9 @@ test_perf_for_dashboard() {
local device=cuda
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
if [[ "${TEST_CONFIG}" == *zen_cpu_x86* ]]; then
device=zen_cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
@ -1131,6 +1153,12 @@ test_custom_backend() {
test_custom_script_ops() {
echo "Testing custom script operators"
if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then
echo "Skipping custom script operators until it's fixed"
return 0
fi
CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"
pushd test/custom_operator
cp -a "$CUSTOM_OP_BUILD" build
@ -1520,7 +1548,7 @@ test_executorch() {
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
@ -1554,7 +1582,8 @@ test_operator_benchmark() {
cd "${TEST_DIR}"/benchmarks/operator_benchmark
$TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \
--output-dir "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv"
--output-csv "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.csv" \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_benchmark_eager_float32_cpu.json" \
pip_install pandas
python check_perf_csv.py \
@ -1639,7 +1668,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
install_torchaudio cuda
fi
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git
TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
@ -1677,6 +1706,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *einops* ]]; then
test_einops
elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then
install_torchvision
test_dynamo_wrapped_shard "${SHARD_NUMBER}"
@ -1724,6 +1755,8 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
test_xpu_bin
elif [[ "${TEST_CONFIG}" == smoke ]]; then
test_python_smoke
elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
else
install_torchvision
install_monkeytype

View File

@ -16,7 +16,7 @@ target_link_libraries(simple-torch-test CUDA::cudart CUDA::cufft CUDA::cusparse
find_library(CUDNN_LIBRARY NAMES cudnn)
target_link_libraries(simple-torch-test ${CUDNN_LIBRARY} )
if(MSVC)
file(GLOB TORCH_DLLS "$ENV{CUDA_PATH}/bin/cudnn64_8.dll" "$ENV{NVTOOLSEXT_PATH}/bin/x64/*.dll")
file(GLOB TORCH_DLLS "$ENV{CUDA_PATH}/bin/cudnn64_8.dll")
message("dlls to copy " ${TORCH_DLLS})
add_custom_command(TARGET simple-torch-test
POST_BUILD

View File

@ -31,7 +31,7 @@ PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"
echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."
echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."
echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the corresponding APIs instead."
echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"
echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"
echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

View File

@ -10,7 +10,7 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
:: able to see what our cl.exe commands are (since you can actually
:: just copy-paste them into a local Windows setup to just rebuild a
:: single file.)
:: log sizes are too long, but leaving this here incase someone wants to use it locally
:: log sizes are too long, but leaving this here in case someone wants to use it locally
:: set CMAKE_VERBOSE_MAKEFILE=1

View File

@ -52,7 +52,7 @@ if __name__ == "__main__":
if os.path.exists(debugger):
command_args = [debugger, "-o", "-c", "~*g; q"] + command_args
command_string = " ".join(command_args)
print("Reruning with traceback enabled")
print("Rerunning with traceback enabled")
print("Command:", command_string)
subprocess.run(command_args, check=False)
sys.exit(e.returncode)

View File

@ -1,59 +0,0 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V118%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe" (
set "CUDA_PATH_V118=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8"
) ELSE (
echo CUDA 11.8 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=3.7+PTX;5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90
)
set "CUDA_PATH=%CUDA_PATH_V118%"
set "PATH=%CUDA_PATH_V118%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -1,59 +0,0 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V124%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe" (
set "CUDA_PATH_V124=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
) ELSE (
echo CUDA 12.4 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90
)
set "CUDA_PATH=%CUDA_PATH_V124%"
set "PATH=%CUDA_PATH_V124%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -18,15 +18,6 @@ REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V126%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" (
set "CUDA_PATH_V126=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6"

View File

@ -18,15 +18,6 @@ REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V128%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (
set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"

View File

@ -0,0 +1,50 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%CUDA_PATH_V129%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe" (
set "CUDA_PATH_V129=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"
) ELSE (
echo CUDA 12.9 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V129%"
set "PATH=%CUDA_PATH_V129%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -65,7 +65,7 @@ for /F "usebackq delims=" %%i in (`python -c "import sys; print('{0[0]}{0[1]}'.f
if %PYVER% LSS 35 (
echo Warning: PyTorch for Python 2 under Windows is experimental.
echo Python x64 3.5 or up is recommended to compile PyTorch on Windows
echo Maybe you can create a virual environment if you have conda installed:
echo Maybe you can create a virtual environment if you have conda installed:
echo ^> conda create -n test python=3.6 pyyaml numpy
echo ^> activate test
)

View File

@ -9,7 +9,6 @@ copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" pytorch\torch\lib
copy "%PYTHON_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib
:: Should be set in build_pytorch.bat

View File

@ -23,66 +23,13 @@ set CUDNN_LIB_FOLDER="lib\x64"
:: Skip all of this if we already have cuda installed
if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" goto set_cuda_env_vars
if %CUDA_VER% EQU 118 goto cuda118
if %CUDA_VER% EQU 124 goto cuda124
if %CUDA_VER% EQU 126 goto cuda126
if %CUDA_VER% EQU 128 goto cuda128
if %CUDA_VER% EQU 129 goto cuda129
echo CUDA %CUDA_VERSION_STR% is not supported
exit /b 1
:cuda118
set CUDA_INSTALL_EXE=cuda_11.8.0_522.06_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_11.8 thrust_11.8 nvcc_11.8 cuobjdump_11.8 nvprune_11.8 nvprof_11.8 cupti_11.8 cublas_11.8 cublas_dev_11.8 cudart_11.8 cufft_11.8 cufft_dev_11.8 curand_11.8 curand_dev_11.8 cusolver_11.8 cusolver_dev_11.8 cusparse_11.8 cusparse_dev_11.8 npp_11.8 npp_dev_11.8 nvrtc_11.8 nvrtc_dev_11.8 nvml_dev_11.8 nvtx_11.8"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda11-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda124
set CUDA_INSTALL_EXE=cuda_12.4.0_551.61_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_12.4 thrust_12.4 nvcc_12.4 cuobjdump_12.4 nvprune_12.4 nvprof_12.4 cupti_12.4 cublas_12.4 cublas_dev_12.4 cudart_12.4 cufft_12.4 cufft_dev_12.4 curand_12.4 curand_dev_12.4 cusolver_12.4 cusolver_dev_12.4 cusparse_12.4 cusparse_dev_12.4 npp_12.4 npp_dev_12.4 nvrtc_12.4 nvrtc_dev_12.4 nvml_dev_12.4 nvjitlink_12.4 nvtx_12.4"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.5.0.50_cuda12-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda126
@ -139,17 +86,39 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda129
set CUDA_INSTALL_EXE=cuda_12.9.1_576.57_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_12.9 thrust_12.9 nvcc_12.9 cuobjdump_12.9 nvprune_12.9 nvprof_12.9 cupti_12.9 cublas_12.9 cublas_dev_12.9 cudart_12.9 cufft_12.9 cufft_dev_12.9 curand_12.9 curand_dev_12.9 cusolver_12.9 cusolver_dev_12.9 cusparse_12.9 cusparse_dev_12.9 npp_12.9 npp_dev_12.9 nvrtc_12.9 nvrtc_dev_12.9 nvml_dev_12.9 nvjitlink_12.9 nvtx_12.9"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.10.2.21_cuda12-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" & REM @lint-ignore
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda_common
:: NOTE: We only install CUDA if we don't have it installed already.
:: With GHA runners these should be pre-installed as part of our AMI process
:: If you cannot find the CUDA version you want to build for here then please
:: add it @ https://github.com/pytorch/test-infra/tree/main/aws/ami/windows
if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" (
if not exist "%SRC_DIR%\temp_build\NvToolsExt.7z" (
curl -k -L https://ossci-windows.s3.us-east-1.amazonaws.com/builder/NvToolsExt.7z --output "%SRC_DIR%\temp_build\NvToolsExt.7z"
if errorlevel 1 exit /b 1
)
if not exist "%SRC_DIR%\temp_build\gpu_driver_dlls.zip" (
curl -k -L "https://ossci-windows.s3.us-east-1.amazonaws.com/builder/additional_dlls.zip" --output "%SRC_DIR%\temp_build\gpu_driver_dlls.zip"
if errorlevel 1 exit /b 1
@ -176,15 +145,6 @@ if not exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_
xcopy /Y "%SRC_DIR%\temp_build\cuda\CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions\*.*" "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations"
)
echo Installing NvToolsExt...
7z x %SRC_DIR%\temp_build\NvToolsExt.7z -o"%SRC_DIR%\temp_build\NvToolsExt"
mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"
mkdir "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"
xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\bin\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\bin\x64"
xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\include\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\include"
xcopy /Y "%SRC_DIR%\temp_build\NvToolsExt\lib\x64\*.*" "%ProgramFiles%\NVIDIA Corporation\NvToolsExt\lib\x64"
echo Installing cuDNN...
7z x %CUDNN_SETUP_FILE% -o"%SRC_DIR%\temp_build\cudnn"
xcopy /Y "%SRC_DIR%\temp_build\cudnn\%CUDNN_FOLDER%\bin\*.*" "%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin"
@ -215,4 +175,3 @@ echo Setting up environment...
set "PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin;%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\libnvvp;%PATH%"
set "CUDA_PATH=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"
set "CUDA_PATH_V%CUDA_VER_MAJOR%_%CUDA_VER_MINOR%=%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%"
set "NVTOOLSEXT_PATH=%ProgramFiles%\NVIDIA Corporation\NvToolsExt"

View File

@ -99,7 +99,6 @@ goto end
:libtorch
echo "install and test libtorch"
if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1
if ERRORLEVEL 1 exit /b 1
@ -111,10 +110,6 @@ pushd tmp\libtorch
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
IF "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

View File

@ -1,14 +1,7 @@
if "%VC_YEAR%" == "2019" powershell windows/internal/vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
:: Please don't delete VS2019 as an alternative, in case some Windows compiler issue.
:: Reference: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2858693930
if "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -products Microsoft.VisualStudio.Product.BuildTools -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

View File

@ -1,48 +0,0 @@
# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479
# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
# 16.8.6 BuildTools
$VS_DOWNLOAD_LINK = "https://ossci-windows.s3.us-east-1.amazonaws.com/vs16.8.6_BuildTools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",
"--add Microsoft.VisualStudio.Component.VC.CoreIde",
"--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2019 Version 16.8.5 installer failed"
exit 1
}
if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath
if ($existingPath -ne $null) {
if (!${env:CIRCLECI}) {
echo "Found correctly versioned existing BuildTools installation in $existingPath"
exit 0
}
echo "Found existing BuildTools installation in $existingPath, keeping it"
}
}
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."
curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS Collect tool failed."
exit 1
}
Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru
New-Item -Path "C:\w\build-results" -ItemType "directory" -Force
Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"
exit 1
}

View File

@ -25,8 +25,8 @@ set XPU_EXTRA_INSTALLED=0
set XPU_EXTRA_UNINSTALL=0
if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.1] (
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/1a9fff3d-04c2-4d77-8861-3d86c774b66f/intel-deep-learning-essentials-2025.1.1.26_offline.exe
set XPU_BUNDLE_VERSION=2025.1.1+23
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/75d4eb97-914a-4a95-852c-7b9733d80f74/intel-deep-learning-essentials-2025.1.3.8_offline.exe
set XPU_BUNDLE_VERSION=2025.1.3+5
)
:: Check if XPU bundle is target version or already installed

View File

@ -206,7 +206,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
BUILD_PYTHON_ONLY=1 BUILD_LIBTORCH_WHL=0 python setup.py bdist_wheel -d "$whl_tmp_dir" --cmake
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
python setup.py bdist_wheel -d "$whl_tmp_dir"

View File

@ -75,8 +75,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == cu128 ]]; then
# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == "cu129" ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi
@ -105,6 +105,7 @@ fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_xpu_version.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

View File

@ -1,157 +0,0 @@
# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0
import json
import os
import re
import sys
import time
import requests
AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"
AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
PIPELINE_ID = "911"
PROJECT_ID = "0628bce4-2d33-499e-bac5-530e12db160f"
TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "main")
TARGET_COMMIT = os.environ.get("CIRCLE_SHA1", "")
build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"
s = requests.Session()
s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})
def submit_build(pipeline_id, project_id, source_branch, source_version):
print("Submitting build for branch: " + source_branch)
print("Commit SHA1: ", source_version)
run_build_raw = s.post(
build_base_url,
json={
"definition": {"id": pipeline_id},
"project": {"id": project_id},
"sourceBranch": source_branch,
"sourceVersion": source_version,
},
)
try:
run_build_json = run_build_raw.json()
except json.decoder.JSONDecodeError as e:
print(e)
print(
"Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired."
)
sys.exit(-1)
build_id = run_build_json["id"]
print("Submitted bulid: " + str(build_id))
print("Bulid URL: " + run_build_json["url"])
return build_id
def get_build(_id):
get_build_url = (
AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"
)
get_build_raw = s.get(get_build_url)
return get_build_raw.json()
def get_build_logs(_id):
get_build_logs_url = (
AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"
)
get_build_logs_raw = s.get(get_build_logs_url)
return get_build_logs_raw.json()
def get_log_content(url):
resp = s.get(url)
return resp.text
def wait_for_build(_id):
build_detail = get_build(_id)
build_status = build_detail["status"]
while build_status == "notStarted":
print("Waiting for run to start: " + str(_id))
sys.stdout.flush()
try:
build_detail = get_build(_id)
build_status = build_detail["status"]
except Exception as e:
print("Error getting build")
print(e)
time.sleep(30)
print("Bulid started: ", str(_id))
handled_logs = set()
while build_status == "inProgress":
try:
print("Waiting for log: " + str(_id))
logs = get_build_logs(_id)
except Exception as e:
print("Error fetching logs")
print(e)
time.sleep(30)
continue
for log in logs["value"]:
log_id = log["id"]
if log_id in handled_logs:
continue
handled_logs.add(log_id)
print("Fetching log: \n" + log["url"])
try:
log_content = get_log_content(log["url"])
print(log_content)
except Exception as e:
print("Error getting log content")
print(e)
sys.stdout.flush()
build_detail = get_build(_id)
build_status = build_detail["status"]
time.sleep(30)
build_result = build_detail["result"]
print("Bulid status: " + build_status)
print("Bulid result: " + build_result)
return build_status, build_result
if __name__ == "__main__":
# Convert the branch name for Azure DevOps
match = re.search(r"pull/(\d+)", TARGET_BRANCH)
if match is not None:
pr_num = match.group(1)
SOURCE_BRANCH = f"refs/pull/{pr_num}/head"
else:
SOURCE_BRANCH = f"refs/heads/{TARGET_BRANCH}"
MAX_RETRY = 2
retry = MAX_RETRY
while retry > 0:
build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)
build_status, build_result = wait_for_build(build_id)
if build_result != "succeeded":
retry = retry - 1
if retry > 0:
print("Retrying... remaining attempt: " + str(retry))
# Wait a bit before retrying
time.sleep((MAX_RETRY - retry) * 120)
continue
else:
print("No more chance to retry. Giving up.")
sys.exit(-1)
else:
break

View File

@ -12,7 +12,9 @@ body:
description: |
Please provide a clear and concise description of what the bug is.
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently.
Your example should be fully self-contained and not rely on any artifact that should be downloaded.
For example:
```python
# All necessary imports at the beginning
@ -26,6 +28,7 @@ body:
If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
If your issue is related to numerical accuracy or reproducibility, please read the [numerical accuracy](https://docs.pytorch.org/docs/stable/notes/numerical_accuracy.html) and [reproducibility](https://docs.pytorch.org/docs/stable/notes/randomness.html) notes. If the difference is not expected as described in these documents, please provide appropriate justification on why one result is wrong and the other is correct.
placeholder: |
A clear and concise description of what the bug is.

View File

@ -14,6 +14,7 @@ self-hosted-runner:
- linux.12xlarge
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.24xlarge.amd
- linux.arm64.2xlarge
- linux.arm64.2xlarge.ephemeral
- linux.arm64.m7g.4xlarge
@ -49,6 +50,7 @@ self-hosted-runner:
# Organization-wide AMD-hosted runners
# MI2xx runners
- linux.rocm.gpu
- linux.rocm.gpu.mi250
- linux.rocm.gpu.2
- linux.rocm.gpu.4
# MI300 runners

View File

@ -9,7 +9,7 @@ inputs:
arch-for-build-env:
description: |
arch to pass to build environment.
This is currently different than the arch name we use elswhere, which
This is currently different than the arch name we use elsewhere, which
should be fixed.
required: true
github-secret:

View File

@ -157,4 +157,4 @@ runs:
echo "Is keep-going label set? ${{ steps.filter.outputs.keep-going }}"
echo
echo "Renabled issues? ${{ steps.filter.outputs.reenabled-issues }}"
echo "Reenabled issues? ${{ steps.filter.outputs.reenabled-issues }}"

View File

@ -153,7 +153,7 @@ runs:
github-token: ${{ inputs.GITHUB_TOKEN }}
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going

View File

@ -13,6 +13,12 @@ inputs:
github-token:
description: GitHub token
required: true
job-id:
description: Job ID
required: true
job-name:
description: Job name
required: true
outputs:
reuse:
@ -30,8 +36,11 @@ runs:
continue-on-error: true
env:
GITHUB_TOKEN: ${{ inputs.github-token }}
JOB_ID: ${{ inputs.job-id }}
JOB_NAME: ${{ inputs.job-name }}
run: |
set -x
python3 -m pip install boto3==1.35.42
python3 ${GITHUB_ACTION_PATH}/reuse_old_whl.py \
--build-environment "${{ inputs.build-environment }}" \
--run-id "${{ inputs.run-id }}" \

View File

@ -1,13 +1,22 @@
import argparse
import os
import subprocess
import sys
from functools import lru_cache
from pathlib import Path
from typing import Any, cast, Optional
from typing import Any, cast, Optional, Union
import requests
REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent
sys.path.insert(0, str(REPO_ROOT))
from tools.stats.upload_metrics import emit_metric
sys.path.remove(str(REPO_ROOT)) # Clean up sys.path after import
FORCE_REBUILD_LABEL = "ci-force-rebuild"
@ -114,15 +123,43 @@ def ok_changed_file(file: str) -> bool:
return True
if file.startswith("test/") and file.endswith(".py"):
return True
if file.startswith("docs/") and file.endswith((".md", ".rst")):
return True
return False
def check_changed_files(sha: str) -> bool:
# Return true if all the changed files are in the list of allowed files to
# be changed to reuse the old whl
# Removing files in the torch folder is not allowed since rsync will not
# remove files
removed_files = (
subprocess.check_output(
[
"git",
"diff",
"--name-only",
sha,
"HEAD",
"--diff-filter=D",
"--no-renames",
],
text=True,
stderr=subprocess.DEVNULL,
)
.strip()
.split()
)
if any(file.startswith("torch/") for file in removed_files):
print(
f"Removed files between {sha} and HEAD: {removed_files}, cannot reuse old whl"
)
return False
changed_files = (
subprocess.check_output(
["git", "diff", "--name-only", sha, "HEAD"],
["git", "diff", "--name-only", sha, "HEAD", "--no-renames"],
text=True,
stderr=subprocess.DEVNULL,
)
@ -179,38 +216,83 @@ def unzip_artifact_and_replace_files() -> None:
)
os.remove("artifacts.zip")
head_sha = get_head_sha()
# Rename wheel into zip
wheel_path = Path("artifacts/dist").glob("*.whl")
for path in wheel_path:
new_path = path.with_suffix(".zip")
os.rename(path, new_path)
print(f"Renamed {path} to {new_path}")
print(new_path.stem)
# Should be of the form torch-2.0.0+git1234567-cp37-etc.whl
# Should usually be the merge base sha but for the ones that didn't do
# the replacement, it won't be. Can probably change it to just be merge
# base later
old_version = f"+git{path.stem.split('+')[1].split('-')[0][3:]}"
new_version = f"+git{head_sha[:7]}"
def rename_to_new_version(file: Union[str, Path]) -> None:
# Rename file with old_version to new_version
subprocess.check_output(
["mv", file, str(file).replace(old_version, new_version)]
)
def change_content_to_new_version(file: Union[str, Path]) -> None:
# Check if is a file
if os.path.isdir(file):
return
# Replace the old version in the file with the new version
with open(file) as f:
content = f.read()
content = content.replace(old_version, new_version)
with open(file, "w") as f:
f.write(content)
zip_path = path.with_suffix(".zip")
os.rename(path, zip_path)
old_stem = zip_path.stem
# Unzip the wheel
subprocess.check_output(
["unzip", "-o", new_path, "-d", f"artifacts/dist/{new_path.stem}"],
["unzip", "-o", zip_path, "-d", f"artifacts/dist/{old_stem}"],
)
# Remove the old wheel (which is now a zip file)
os.remove(zip_path)
# Copy python files into the artifact
subprocess.check_output(
["rsync", "-avz", "torch", f"artifacts/dist/{new_path.stem}"],
["rsync", "-avz", "torch", f"artifacts/dist/{old_stem}"],
)
change_content_to_new_version(f"artifacts/dist/{old_stem}/torch/version.py")
for file in Path(f"artifacts/dist/{old_stem}").glob(
"*.dist-info/**",
):
change_content_to_new_version(file)
rename_to_new_version(f"artifacts/dist/{old_stem}")
new_stem = old_stem.replace(old_version, new_version)
for file in Path(f"artifacts/dist/{new_stem}").glob(
"*.dist-info",
):
rename_to_new_version(file)
# Zip the wheel back
subprocess.check_output(
["zip", "-r", f"{new_path.stem}.zip", "."],
cwd=f"artifacts/dist/{new_path.stem}",
["zip", "-r", f"{new_stem}.zip", "."],
cwd=f"artifacts/dist/{new_stem}",
)
subprocess.check_output(
[
"mv",
f"artifacts/dist/{new_path.stem}/{new_path.stem}.zip",
f"artifacts/dist/{new_path.stem}.whl",
f"artifacts/dist/{new_stem}/{new_stem}.zip",
f"artifacts/dist/{new_stem}.whl",
],
)
# Remove the extracted folder
subprocess.check_output(
["rm", "-rf", f"artifacts/dist/{new_path.stem}"],
["rm", "-rf", f"artifacts/dist/{new_stem}"],
)
# Rezip the artifact
@ -244,46 +326,60 @@ def parse_args() -> argparse.Namespace:
return parser.parse_args()
def can_reuse_whl(args: argparse.Namespace) -> bool:
# if is_main_branch() or (
# args.github_ref
# and any(
# args.github_ref.startswith(x)
# for x in ["refs/heads/release", "refs/tags/v", "refs/heads/main"]
# )
# ):
# print("On main branch or release branch, rebuild whl")
# return False
if check_labels_for_pr():
print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")
return False
if check_issue_open():
print("Issue #153759 is open, rebuild whl")
return False
def can_reuse_whl(args: argparse.Namespace) -> tuple[bool, str]:
if args.github_ref and any(
args.github_ref.startswith(x)
for x in [
"refs/heads/release",
"refs/tags/v",
"refs/heads/nightly",
]
):
print("Release branch, rebuild whl")
return (False, "Release branch")
if not check_changed_files(get_merge_base()):
print("Cannot use old whl due to the changed files, rebuild whl")
return False
return (False, "Changed files not allowed")
if check_labels_for_pr():
print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")
return (False, "Found FORCE_REBUILD_LABEL on PR")
if check_issue_open():
print("Issue #153759 is open, rebuild whl")
return (False, "Issue #153759 is open")
workflow_id = get_workflow_id(args.run_id)
if workflow_id is None:
print("No workflow ID found, rebuild whl")
return False
return (False, "No workflow ID found")
if not find_old_whl(workflow_id, args.build_environment, get_merge_base()):
print("No old whl found, rebuild whl")
return (False, "No old whl found")
# TODO: go backwards from merge base to find more runs
return False
return True
return (True, "Found old whl")
if __name__ == "__main__":
args = parse_args()
if can_reuse_whl(args):
reuse_whl, reason = can_reuse_whl(args)
if reuse_whl:
print("Reusing old whl")
unzip_artifact_and_replace_files()
set_output()
emit_metric(
"reuse_old_whl",
{
"reuse_whl": reuse_whl,
"reason": reason,
"build_environment": args.build_environment,
"merge_base": get_merge_base(),
"head_sha": get_head_sha(),
},
)

View File

@ -33,14 +33,14 @@ runs:
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Start docker if docker deamon is not running
- name: Start docker if docker daemon is not running
shell: bash
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";
else
echo "Starting docker deamon..." && sudo systemctl start docker;
echo "Starting docker daemon..." && sudo systemctl start docker;
fi
- name: Log in to ECR

View File

@ -29,13 +29,13 @@ runs:
if: always()
shell: bash
run: |
xpu-smi discovery
timeout 30 xpu-smi discovery || true
- name: Runner health check GPU count
if: always()
shell: bash
run: |
ngpu=$(xpu-smi discovery | grep -c -E 'Device Name')
ngpu=$(timeout 30 xpu-smi discovery | grep -c -E 'Device Name' || true)
msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
if [[ $ngpu -eq 0 ]]; then
echo "Error: Failed to detect any GPUs on the runner"

View File

@ -1 +1 @@
1a8f6213b0b61efc6a4862bc45b853551a93dbb6
4e94321c54617dd738a05bfedfc28bc0fa635b5c

View File

@ -1 +1 @@
edc1a882d872dd7f1362e4312fd045a1d81b3355
r2.8

1
.github/labeler.yml vendored
View File

@ -116,7 +116,6 @@
"release notes: inductor (aoti)":
- torch/_C/_aoti.pyi
- torch/_dynamo/repro/aoti.py
- torch/_export/serde/aoti_schema.py
- torch/_higher_order_ops/aoti_call_delegate.py
- torch/_inductor/codegen/aoti_runtime/**
- torch/_inductor/codegen/aoti_hipify_utils.py

View File

@ -123,6 +123,8 @@
- torch/*docs.py
approved_by:
- svekars
- sekyondaMeta
- AlannaBurke
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -11,6 +11,7 @@ ciflow_push_tags:
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-cu126
- ciflow/linux-aarch64
- ciflow/mps
@ -28,6 +29,7 @@ ciflow_push_tags:
- ciflow/op-benchmark
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
retryable_workflows:
- pull
- trunk

View File

@ -10,5 +10,5 @@ lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84
pyyaml==6.0
requests==2.32.2
requests==2.32.4
rich==10.9.0

View File

@ -2,5 +2,4 @@
certifi
pip=23.2.1
pkg-config=0.29.2
setuptools=72.1.0
wheel=0.37.1

View File

@ -1,5 +1,5 @@
boto3==1.35.42
cmake==3.25.*
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.6.0
@ -14,7 +14,7 @@ opt-einsum>=3.3
optree==0.13.0
packaging==23.1
parameterized==0.8.1
pillow==10.0.1
pillow==10.3.0
protobuf==5.29.4
psutil==5.9.1
pygments==2.15.0
@ -26,7 +26,9 @@ pytest-xdist==3.3.1
pytest==7.3.2
pyyaml==6.0.2
scipy==1.12.0
setuptools==72.1.0
sympy==1.13.3
tlparse==0.3.30
tensorboard==2.13.0
typing-extensions==4.12.2
unittest-xml-reporting<=3.2.0,>=2.0.0

View File

@ -78,7 +78,7 @@ for pkg in /$WHEELHOUSE_DIR/*triton*.whl; do
echo "Copied $filepath to $patchedpath"
done
# Go through all required shared objects and see if any of our other objects are dependants. If so, replace so.ver wth so
# Go through all required shared objects and see if any of our other objects are dependants. If so, replace so.ver with so
for ((i=0;i<${#deps[@]};++i)); do
echo "replacing "${deps_soname[i]} ${patched[i]}
replace_needed_sofiles $PREFIX/$ROCM_LIB ${deps_soname[i]} ${patched[i]}

View File

@ -21,8 +21,11 @@ def read_triton_pin(device: str = "cuda") -> str:
return f.read().strip()
def read_triton_version() -> str:
with open(REPO_DIR / ".ci" / "docker" / "triton_version.txt") as f:
def read_triton_version(device: str = "cuda") -> str:
triton_version_file = "triton_version.txt"
if device == "xpu":
triton_version_file = "triton_xpu_version.txt"
with open(REPO_DIR / ".ci" / "docker" / triton_version_file) as f:
return f.read().strip()
@ -65,6 +68,7 @@ def build_triton(
with TemporaryDirectory() as tmpdir:
triton_basedir = Path(tmpdir) / "triton"
triton_pythondir = triton_basedir / "python"
triton_repo = "https://github.com/openai/triton"
if device == "rocm":
triton_pkg_name = "pytorch-triton-rocm"
@ -90,7 +94,7 @@ def build_triton(
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}",
expected_version=None,
expected_version=read_triton_version(device),
)
if device == "rocm":
@ -101,11 +105,19 @@ def build_triton(
)
print("ROCm libraries setup for triton installation...")
check_call(
[sys.executable, "setup.py", "bdist_wheel"], cwd=triton_pythondir, env=env
# old triton versions have setup.py in the python/ dir,
# new versions have it in the root dir.
triton_setupdir = (
triton_basedir
if (triton_basedir / "setup.py").exists()
else triton_pythondir
)
whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))
check_call(
[sys.executable, "setup.py", "bdist_wheel"], cwd=triton_setupdir, env=env
)
whl_path = next(iter((triton_setupdir / "dist").glob("*.whl")))
shutil.copy(whl_path, Path.cwd())
if device == "rocm":
@ -128,15 +140,19 @@ def main() -> None:
parser.add_argument("--py-version", type=str)
parser.add_argument("--commit-hash", type=str)
parser.add_argument("--with-clang-ldd", action="store_true")
parser.add_argument("--triton-version", type=str, default=read_triton_version())
parser.add_argument("--triton-version", type=str, default=None)
args = parser.parse_args()
triton_version = read_triton_version(args.device)
if args.triton_version:
triton_version = args.triton_version
build_triton(
device=args.device,
commit_hash=(
args.commit_hash if args.commit_hash else read_triton_pin(args.device)
),
version=args.triton_version,
version=triton_version,
py_version=args.py_version,
release=args.release,
with_clang_ldd=args.with_clang_ldd,

View File

@ -40,9 +40,9 @@ SUPPORTED_PERIODICAL_MODES: dict[str, Callable[[Optional[str]], bool]] = {
}
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=HnkH0xQWnnsoeMsSIVf9291NE5c4jWSa"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=iP_F8gBs60PfOMAJ8gnn1paVrzM1WYsK"
# Some constants used to handle disabled and unstable jobs
JOB_NAME_SEP = "/"
@ -80,7 +80,7 @@ def parse_args() -> Any:
parser.add_argument(
"--job-name",
type=str,
help="the name of the current job, i.e. linux-focal-py3.8-gcc7 / build",
help="the name of the current job, i.e. linux-jammy-py3.8-gcc7 / build",
)
parser.add_argument("--pr-number", type=str, help="the pull request number")
parser.add_argument("--tag", type=str, help="the associated tag if it exists")

View File

@ -15,21 +15,21 @@ import os
from typing import Optional
# NOTE: Also update the CUDA sources in tools/nightly.py when changing this list
CUDA_ARCHES = ["11.8", "12.6", "12.8"]
CUDA_STABLE = "12.6"
# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this
CUDA_ARCHES = ["12.6", "12.8", "12.9"]
CUDA_STABLE = "12.8"
CUDA_ARCHES_FULL_VERSION = {
"11.8": "11.8.0",
"12.6": "12.6.3",
"12.8": "12.8.1",
"12.9": "12.9.1",
}
CUDA_ARCHES_CUDNN_VERSION = {
"11.8": "9",
"12.6": "9",
"12.8": "9",
"12.9": "9",
}
# NOTE: Also update the ROCm sources in tools/nightly.py when changing this list
# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this
ROCM_ARCHES = ["6.3", "6.4"]
XPU_ARCHES = ["xpu"]
@ -38,35 +38,22 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCHES = ["12.8-aarch64"]
CUDA_AARCH64_ARCHES = ["12.9-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"11.8": (
"nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | " # noqa: B950
"nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.6": (
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -75,25 +62,41 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.9": (
"nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"xpu": (
"intel-cmplr-lib-rt==2025.1.1 | "
"intel-cmplr-lib-ur==2025.1.1 | "
"intel-cmplr-lic-rt==2025.1.1 | "
"intel-sycl-rt==2025.1.1 | "
"oneccl-devel==2021.15.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"oneccl==2021.15.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"oneccl-devel==2021.15.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"oneccl==2021.15.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"impi-rt==2021.15.0; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"onemkl-sycl-blas==2025.1.0 | "
"onemkl-sycl-dft==2025.1.0 | "
@ -107,7 +110,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"tbb==2022.1.0 | "
"tcmlib==1.3.0 | "
"umf==0.10.0 | "
"intel-pti==0.12.0"
"intel-pti==0.12.3"
),
}
@ -311,10 +314,10 @@ def generate_wheels_matrix(
continue
if use_split_build and (
arch_version not in ["12.6", "12.8", "11.8", "cpu"] or os != "linux"
arch_version not in ["12.6", "12.8", "12.9", "cpu"] or os != "linux"
):
raise RuntimeError(
"Split build is only supported on linux with cuda 12*, 11.8, and cpu.\n"
"Split build is only supported on linux with cuda 12* and cpu.\n"
f"Currently attempting to build on arch version {arch_version} and os {os}.\n"
"Please modify the matrix generation to exclude this combination."
)
@ -322,7 +325,7 @@ def generate_wheels_matrix(
# cuda linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["12.8", "12.6", "11.8"]
arch_version in ["12.9", "12.8", "12.6"]
and os == "linux"
or arch_version in CUDA_AARCH64_ARCHES
):
@ -411,6 +414,6 @@ def generate_wheels_matrix(
return ret
validate_nccl_dep_consistency("12.9")
validate_nccl_dep_consistency("12.8")
validate_nccl_dep_consistency("12.6")
validate_nccl_dep_consistency("11.8")

View File

@ -152,7 +152,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.6", "12.8"],
arches=["12.6", "12.8", "12.9", "6.4"],
python_versions=["3.9"],
),
branches="main",

View File

@ -64,7 +64,7 @@ def fetch_url(
)
exception_message = (
"Is github alright?",
f"Recieved status code '{err.code}' when attempting to retrieve {url}:\n",
f"Received status code '{err.code}' when attempting to retrieve {url}:\n",
f"{err.reason}\n\nheaders={err.headers}",
)
raise RuntimeError(exception_message) from err

View File

@ -211,7 +211,7 @@ class GitRepo:
self, from_branch: str, to_branch: str
) -> tuple[list[str], list[str]]:
"""
Returns list of commmits that are missing in each other branch since their merge base
Returns list of commits that are missing in each other branch since their merge base
Might be slow if merge base is between two branches is pretty far off
"""
from_ref = self.rev_parse(from_branch)

Binary file not shown.

View File

@ -12,7 +12,7 @@ BASE=${BASE:-HEAD~1}
HEAD=${HEAD:-HEAD}
ancestor=$(git merge-base "${BASE}" "${HEAD}")
echo "INFO: Checking aginst the following stats"
echo "INFO: Checking against the following stats"
(
set -x
git diff --stat=10000 "$ancestor" "${HEAD}" | sed '$d' > "${TMPFILE}"

View File

@ -347,26 +347,26 @@ class TestConfigFilter(TestCase):
{
"job_name": "a-ci-job",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Replicate each periodic mode in a different config",
"description": "Replicate each periodic mode in a different config",
},
{
"job_name": "a-ci-cuda11.8-job",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Replicate each periodic mode in a different config for a CUDA job",
"description": "Replicate each periodic mode in a different config for a CUDA job",
},
{
"job_name": "a-ci-rocm-job",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Replicate each periodic mode in a different config for a ROCm job",
"description": "Replicate each periodic mode in a different config for a ROCm job",
},
{
"job_name": "",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Empty job name",
"description": "Empty job name",
},
{
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Missing job name",
"description": "Missing job name",
},
]

View File

@ -19,6 +19,7 @@ from urllib.error import HTTPError
from github_utils import gh_graphql
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import (
_revlist_to_prs,
categorize_checks,
DRCI_CHECKRUN_NAME,
find_matching_merge_rule,
@ -264,7 +265,7 @@ class DummyGitRepo(GitRepo):
return ["FakeCommitSha"]
def commit_message(self, ref: str) -> str:
return "super awsome commit message"
return "super awesome commit message"
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@ -432,7 +433,7 @@ class TestTryMerge(TestCase):
)
def test_cancelled_gets_ignored(self, *args: Any) -> None:
"""Tests that cancelled workflow does not override existing successfull status"""
"""Tests that cancelled workflow does not override existing successful status"""
pr = GitHubPR("pytorch", "pytorch", 110367)
conclusions = pr.get_checkrun_conclusions()
lint_checks = [name for name in conclusions.keys() if "Lint" in name]
@ -1088,5 +1089,51 @@ class TestGitHubPRGhstackDependencies(TestCase):
)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
)
@mock.patch.object(DummyGitRepo, "commit_message")
class TestRevListToPR(TestCase):
# Tests for _revlist_to_prs function
def test__revlist_to_prs_zero_matches(
self, mock_commit_message: mock.MagicMock, *args: Any
) -> None:
# If zero PRs are mentioned in the commit message, it should raise an error
pr_num = 154098
pr = GitHubPR("pytorch", "pytorch", pr_num)
repo = DummyGitRepo()
mock_commit_message.return_value = "no PRs"
self.assertRaisesRegex(
RuntimeError,
"PRs mentioned in commit dummy: 0.",
lambda: _revlist_to_prs(repo, pr, ["dummy"]),
)
def test__revlist_to_prs_two_prs(
self, mock_commit_message: mock.MagicMock, *args: Any
) -> None:
# If two PRs are mentioned in the commit message, it should raise an error
pr_num = 154394
pr = GitHubPR("pytorch", "pytorch", pr_num)
repo = DummyGitRepo()
# https://github.com/pytorch/pytorch/commit/343c56e7650f55fd030aca0b9275d6d73501d3f4
commit_message = """add sticky cache pgo
ghstack-source-id: 9bc6dee0b427819f978bfabccb72727ba8be2f81
Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/154098
ghstack-source-id: 9bc6dee0b427819f978bfabccb72727ba8be2f81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154394"""
mock_commit_message.return_value = commit_message
self.assertRaisesRegex(
RuntimeError,
"PRs mentioned in commit dummy: 2.",
lambda: _revlist_to_prs(repo, pr, ["dummy"]),
)
if __name__ == "__main__":
main()

View File

@ -628,11 +628,17 @@ def _revlist_to_prs(
rc: list[tuple[GitHubPR, str]] = []
for idx, rev in enumerate(rev_list):
msg = repo.commit_message(rev)
m = RE_PULL_REQUEST_RESOLVED.search(msg)
if m is None:
# findall doesn't return named captures, so we need to use finditer
all_matches = list(RE_PULL_REQUEST_RESOLVED.finditer(msg))
if len(all_matches) != 1:
raise RuntimeError(
f"Could not find PR-resolved string in {msg} of ghstacked PR {pr.pr_num}"
f"Found an unexpected number of PRs mentioned in commit {rev}: "
f"{len(all_matches)}. This is probably because you are using an "
"old version of ghstack. Please update ghstack and resubmit "
"your PRs"
)
m = all_matches[0]
if pr.org != m.group("owner") or pr.project != m.group("repo"):
raise RuntimeError(
f"PR {m.group('number')} resolved to wrong owner/repo pair"
@ -666,6 +672,9 @@ def get_ghstack_prs(
assert pr.is_ghstack_pr()
entire_stack = _revlist_to_prs(repo, pr, reversed(rev_list), skip_func)
print(
f"Found {len(entire_stack)} PRs in the stack for {pr.pr_num}: {[x[0].pr_num for x in entire_stack]}"
)
for stacked_pr, rev in entire_stack:
if stacked_pr.is_closed():

View File

@ -17,7 +17,6 @@ if errorlevel 1 exit /b 1
set "PATH=C:\Tools;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%\libnvvp;%PATH%"
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUVER%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
mkdir magma_cuda%CUVER_NODOT%
cd magma_cuda%CUVER_NODOT%
@ -35,15 +34,15 @@ cd magma
mkdir build && cd build
set GPU_TARGET=All
if "%CUVER_NODOT%" == "129" (
set CUDA_ARCH_LIST=-gencode=arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
)
if "%CUVER_NODOT%" == "128" (
set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
)
if "%CUVER_NODOT:~0,2%" == "12" if NOT "%CUVER_NODOT%" == "128" (
if "%CUVER_NODOT%" == "126" (
set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
)
if "%CUVER_NODOT%" == "118" (
set CUDA_ARCH_LIST= -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
)
set CC=cl.exe
set CXX=cl.exe

View File

@ -32,7 +32,7 @@ concurrency:
{%- macro setup_ec2_windows() -%}
!{{ display_ec2_information() }}
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}

View File

@ -56,7 +56,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -114,12 +114,12 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
{%- elif config["gpu_arch_type"] == "rocm" %}
runs_on: linux.rocm.gpu
{%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] == "12.8" %}
{%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] in ["12.8", "12.9"] %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 build needs sm_70+ runner
{%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] != "12.8"%}
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 and 12.9 build need sm_70+ runner
{%- elif config["gpu_arch_type"] == "cuda" %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
runs_on: linux.4xlarge.nvidia.gpu # for other cuda versions, we use 4xlarge runner
{%- else %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
@ -150,10 +150,10 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: !{{ config["container_image"] }}
@ -161,7 +161,7 @@ jobs:
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
@ -171,7 +171,7 @@ jobs:
- name: Teardown XPU
uses: ./.github/actions/teardown-xpu
{%- else %}
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config) }}
steps:
@ -182,7 +182,7 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
@ -196,7 +196,7 @@ jobs:
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: !{{ config["container_image"] }}
@ -204,7 +204,7 @@ jobs:
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary

View File

@ -76,7 +76,7 @@ jobs:
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Populate binary env
run: |
# shellcheck disable=SC1091

View File

@ -64,7 +64,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.8
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -135,7 +135,7 @@ jobs:
{%- else %}
!{{ set_runner_specific_vars() }}
!{{ common.setup_ec2_windows() }}
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
{%- endif %}
- name: Populate binary env
shell: bash
@ -211,7 +211,7 @@ jobs:
"pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
{%- else %}
!{{ common.setup_ec2_windows() }}
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
!{{ set_runner_specific_vars() }}
{%- endif %}
- uses: !{{ common.download_artifact_action }}

View File

@ -47,7 +47,7 @@ jobs:
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8
with:
fetch-depth: 1
submodules: false
@ -69,25 +69,25 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -97,7 +97,7 @@ jobs:
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8
if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Output disk space left
@ -209,5 +209,5 @@ jobs:
file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8
if: always()

View File

@ -151,13 +151,13 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
@ -187,7 +187,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
@ -222,9 +221,9 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
with:
# If doing this in main or release branch, use docker.io. Otherwise
# If doing this in release/2.8 or release branch, use docker.io. Otherwise
# use ECR
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: ${{ inputs.DOCKER_IMAGE }}
@ -236,7 +235,7 @@ jobs:
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -293,7 +292,7 @@ jobs:
- name: Teardown Linux
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8
- name: Chown workspace
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

View File

@ -134,14 +134,14 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.8
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
# Setup the environment
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.8
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
@ -164,7 +164,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
show-progress: false
path: pytorch
@ -195,7 +194,7 @@ jobs:
path: "${{ runner.temp }}/artifacts/"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.8
if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}
- name: configure aws credentials
@ -210,7 +209,7 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.8
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: ${{ inputs.DOCKER_IMAGE }}
@ -220,7 +219,7 @@ jobs:
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.8
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -232,7 +231,7 @@ jobs:
- name: Teardown Linux
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.8
- name: Chown workspace
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

Some files were not shown because too many files have changed in this diff Show More