CUDA 13.0 builds fix on Amazon Linux 2023 (#164870)
During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds
This is related to:
https://github.com/pytorch/pytorch/issues/152756
Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262
Error:
```
WARNING: There was an error checking the latest version of pip.
+ python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps
ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcudart.so.13: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module>
import torch
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module>
_load_global_deps()
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps
_preload_cuda_deps(lib_folder, lib_name)
File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps
raise ValueError(f"{lib_name} not found in the system path {sys.path}")
Traceback (most recent call last):
ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages']
File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
main()
File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
run_cmd_or_die(f"docker exec -t {container_name} /exec")
File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870
Approved by: https://github.com/Camyll
(cherry picked from commit 483f4e0db91166128ad8922d86dc7222338d4ecc)
Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
fix cpp extension distributed warning spew (#162764)
With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead
Test plan
```python
# torchrun --nproc_per_node=2 demo_fix.py
import os
import logging
logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)
import torch
if 'RANK' in os.environ:
torch.distributed.init_process_group('nccl')
from torch.utils.cpp_extension import _get_cuda_arch_flags
_get_cuda_arch_flags()
print(f"Rank {os.environ.get('RANK', '0')} done")
```
Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be
```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
[rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override.
Rank 0 done
Rank 1 done
```
But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)`
Then we get
```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
Rank 0 done
Rank 1 done
(source) [marksaroufim@devgpu005]~%
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764
Approved by: https://github.com/ezyang, https://github.com/zou3519
(cherry picked from commit f7e83219619a05934a344ca699c33ee69d5a3642)
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Reapply "Make functionalization `ViewMeta` serializable with pickle. (#143712)" (#163769)
NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files
(Summary from #161994 )
Attempted rebase of https://github.com/pytorch/pytorch/pull/143712.
This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela
imported-using-ghimport
Test Plan: Imported from OSS
Differential Revision: D81524507
Pulled By: Lucaskabela
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769
Approved by: https://github.com/dolpm
(cherry picked from commit 7d710403b003e44bf31d367673a05468e49df75d)
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
[Flex] Fix silent correctness w/ backpropping grads (#163677)
Fixes #https://github.com/pytorch/pytorch/issues/162228
# Summary
Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here.
TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677
Approved by: https://github.com/ydwu4
(cherry picked from commit e2ce79e4cce5327b71fcf366fad1133030563285)
Co-authored-by: drisspg <drisspguessous@gmail.com>
[a2av] Separate in/out splits into two tensors (#163837)
Old signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)`
New signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)`
i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better.
Also to be in line with the 2D version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837
Approved by: https://github.com/fduwjj
ghstack dependencies: #163886
(cherry picked from commit bbf8aa43efe755b9c310347b3780962fca85bf9c)
Co-authored-by: Ke Wen <kw2501@meta.com>
[Flex attention] Fix flex attention head broadcast (#163426)
Fixes part of #163314
In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**
This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.
The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
(cherry picked from commit 1a42656d6c43a9bb7eb90c511884ce451d29422f)
Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
Update Microsoft C++ Redistributable to the latest version (#161430)
Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161430
Approved by: https://github.com/malfet
(cherry picked from commit 1330c638bef7fac64a42935b5a46ee32637ddd4d)
Co-authored-by: Saman Khatir <saman.khatir@amd.com>
[MPSHooks] Release pending command encoder (#164093)
Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error
```
tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer'
```
Added regression test to `test_mps_extension`
Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that
Fixes https://github.com/pytorch/pytorch/issues/163721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093
Approved by: https://github.com/atalman, https://github.com/Skylion007
(cherry picked from commit 8f32adc90a7fee83583c9ba89dbdfabb317e0452)
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
[SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598)
Fixes#163597
- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.
### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)
Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro
### Vector Fast Path (q_len=1, k_len=256)
- Before: 0.160 ms
- After: 0.157 ms
### Vector 2-pass (q_len=1, k_len=4096)
- Before: 0.342 ms
- After: 0.339 ms
### Vector Fast Path (q_len=8, k_len=256)
- Before: 0.228 ms
- After: 0.231 ms
### Vector 2-pass (q_len=8, k_len=4096)
- Before: 0.432 ms
- After: 0.436 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet
(cherry picked from commit 1c12d7416bc4f1cf0bc8a229e64169fc361b688e)
Co-authored-by: Vismai Khanderao <59114226+Vismai-Khanderao@users.noreply.github.com>
[AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988)
See also #163972, which was intended to be this PR.
Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark.
Fixes https://github.com/pytorch/pytorch/issues/163801
Test Plan:
Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: https://github.com/pytorch/pytorch/pull/119750 and 5c814e2527
Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then c6ad34f7eb/python/triton/knobs.py (L216) would still complain ptxas not found (if removed - it won't know this new one available)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988
Approved by: https://github.com/atalman
(cherry picked from commit 3b4ad4a17d69e2db495ecaf3bae8916282a4eb0d)
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Add operator benchmarking run to CI nightly (#162530)
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.
**Benchmark Configuration and Coverage:**
* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn
(cherry picked from commit 54b38f3b46c33a1cc4e8f7894619358afcbd7c89)
Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Update the operator benchmarking, to benchmark using torch.compile (#161394)
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
(cherry picked from commit af60398c3a057506363e028bf328843a755b4f24)
Co-authored-by: jainapurva <apurvajain.kota@gmail.com>
[SymmMem] Barrier on team instead of world (#163298)
As titled. Avoiding a potential hang when running dispatch and combine in subgroups.
The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163298
Approved by: https://github.com/fegin
(cherry picked from commit f8fb437197033c33ecc435cd5e1e6a5b2bc5bf69)
Co-authored-by: Ke Wen <kw2501@meta.com>
[SymmMem] Fix memory allocation hold-up (#162680)
Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.
Cause:
Handles in `symm_mems_` (a map) keeps reference to memory allocations.
Solution:
- Remove reference to allocation from handles -- the reference is never used anyway.
- Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162680
Approved by: https://github.com/ezyang
ghstack dependencies: #163298
(cherry picked from commit 7130b174e07dbc1a708934b18dede3d88e8f779f)
Co-authored-by: Ke Wen <kw2501@meta.com>
[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)
On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
(cherry picked from commit 9f8a311af09586ac4026d6a56fc7c4ac7acc62ed)
Co-authored-by: xinan.lin <xinan.lin@intel.com>
[CI] Fix test_triton_wait_until hang (#163886)
I don't know why `nvshmem_barrier_all_kernel` leads the test to hang. Will investigate.
But since it is an unnecessary call here, I am removing it to unblock other PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886
Approved by: https://github.com/fegin
(cherry picked from commit 96275dbf88372bb32a123c4ea918498128fbecb9)
Co-authored-by: Ke Wen <kw2501@meta.com>
[CD] Add statically linked windows libraries to exclude list (#163768)
Fixes: https://github.com/pytorch/pytorch/issues/159514
Seeing following in the Wheel build logs:
```
Linking CXX static library lib\kineto.lib
Linking CXX static library lib\dnnl.lib
....
```
These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds.
Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768
Approved by: https://github.com/albanD, https://github.com/malfet
(cherry picked from commit 98c4e35f14601909c113b4fd2857b6f0fb525316)
Co-authored-by: atalman <atalman@fb.com>
[CI] Move Windows build/tests to Python-3.10 (#162862)
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
- `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
- With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
- Copied test type dependencies from be01a40157/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1 (L16) into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps
I think in the long run, one needs to update 4432e2cacd/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
ghstack dependencies: #163339, #163341
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
[BE] Introduce `CONDA_ROOT_DIR` (#163341)
Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163341
Approved by: https://github.com/clee2000
ghstack dependencies: #163339
(cherry picked from commit a273475b01e912f402378a522bb9c4ed37e8413a)
Co-authored-by: Nikita Shulga <nshulga@meta.com>
[export] Remove .contiguous() when saving weights to raw bytes (#163587)
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.
Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing
Differential Revision: D83016250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
(cherry picked from commit 720a7b2887ca4efc8d63b32373182bc97918c76e)
Co-authored-by: Yiming Zhou <yimingzhou@meta.com>
Update pytorch_sphinx_theme2 to latest hash (#163269)
The updated theme:
- Fixes articleBody in the json+ld that caused previous Google Search issues
- Other minor fixes
- 404.html fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269
Approved by: https://github.com/albanD
(cherry picked from commit 68e75be86ab618bb6b1dc32b603a780ff6046262)
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
[SymmMem] Fix put_signal + wait_until hang (#163194)
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152
(cherry picked from commit 80f8be9840c20c3efe1274266b52ab098f4d1030)
Co-authored-by: Ke Wen <kw2501@meta.com>
[Graph Partition] improve custom op output alias (#163227)
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```
If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.
However, when there are mutating args, we don't see `del buf1` immediately.
```python
@torch.library.custom_op(
"mylib::op1",
mutates_args=["x"],
schema="(Tensor(a!)? x) -> (Tensor, Tensor)",
device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
x = x + 1
return (x + 1, x + 2)
```
<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />
Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)
According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.
Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519
(cherry picked from commit 4967ad8baa724b8b1acc123698bb1265723feb87)
Co-authored-by: Boyuan Feng <boyuan@meta.com>
Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008)
Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor
Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008
Approved by: https://github.com/huydhn
(cherry picked from commit e28983be76aa4651e3cb69dc3a4234d75038d938)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
[SymmMem] Fix NVSHMEM plugin + Triton 3.5 (#163152)
1. The dispatch signatures defined in `core.extern_elementwise` call must match the C signature of the NVSHMEM functions, in particular the dtypes. Otherwise, there would be weird errors, such as IMA or hang. When matched, most of time the NVSHMEM device function will be inlined into the generated PTX. When not matched, it is represented as a function call in the PTX (not sure if it is the function call that goes wrong).
2. When calling the `core.extern` wrappers from the `triton.jit` kernels, the input must be cast to match the signatures defined in 1, e.g. via `nbytes.to(tl.int64)`. Otherwise, Triton will report a key error when searching for such kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163152
Approved by: https://github.com/ngimel
ghstack dependencies: #163025
(cherry picked from commit 57a54a04b6eb78e0aa7d13b48e25fb8c0c49fd60)
Co-authored-by: Ke Wen <kw2501@meta.com>
Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240)
This makes gemma3 exportable on transformers=4.55.4
In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation.
Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically:
1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied
2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types.
3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator.
4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark)
Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758
Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240
Approved by: https://github.com/ydwu4
(cherry picked from commit 463fbc8ca0537e5635236190d2ca38ce6fcef831)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
[CD] Aarch64 Fix packaging ``libarm_compute.so`` and other libraries to the aarch64 CUDA wheels (#162566)
Fixes aarch64 linux packaging, following error:
https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62
```
Traceback (most recent call last):
File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module>
import torch
File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: libarm_compute.so: cannot open shared object file: No such file or directory
```
Due to missing dependencies.
Current Error:
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Hence the repackaging does not take any effect.
This PR does following
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl deleted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file.
```
2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling
2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1
2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5
2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so
2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so
2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0
2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0
2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0
2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0
2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64
2025-09-10T10:10:14.1503482Z to Tag: cp310-cp310-manylinux_2_28_aarch64
2025-09-10T10:10:14.1503682Z
2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
```
Test Plan, Executed on local file:
```
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt
inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD
Bundling CUDA libraries with wheel
Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64
to Tag: cp310-cp310-manylinux_2_28_aarch64
Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts
Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl..
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566
Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug
(cherry picked from commit 3d32bb114bf0d5bd0193dc40f20253635dddf080)
Co-authored-by: atalman <atalman@fb.com>
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.
In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
## Summary
This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard *Google internal compile setup* (built with bazel).
## Details
The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers.
## Error message:
```
torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>'
8 | std::ofstream file(filename_, std::ios::binary);
| ^
libcxx/include/__fwd/fstream.h:26:7: note: template is declared here
26 | class basic_ofstream;
| ^
1 error generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421
Approved by: https://github.com/ezyang
Fixes#159590
This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).
This also handles cases such as #159590 where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
This PR is quite large in that it covers most of rough edges in the new strict export flow:
1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162167
Summary:
When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location.
In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field.
Test Plan:
test_package.py -k test_code_with_generator
Rollback Plan:
Differential Revision: D81929751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389
Approved by: https://github.com/dolpm, https://github.com/hrithick-codes
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Approved by: https://github.com/ezyang
Test Plan:
contbuild & OSS CI, see e444cd24d4
Rollback Plan:
Differential Revision: D80435179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
# Summary
### Update
API
```Py
class AuxRequest(NamedTuple):
"""Request which auxiliary outputs to compute from flex_attention.
Each field is a boolean indicating whether that auxiliary output should be computed.
"""
lse: bool = False
max_scores: bool = False
class AuxOutput(NamedTuple):
"""Auxiliary outputs from flex_attention operation.
Fields will be None if not requested, or contain the tensor if requested.
"""
lse: Optional[Tensor] = None
max_scores: Optional[Tensor] = None
out_only = flex_attention(query, key, value, score_mod)
out_max, aux_max = flex_attention(
query,
key,
value,
score_mod,
return_aux=FlexAttentionAuxRequest(max_scores=True),
)
out_both, aux_both = flex_attention(
query,
key,
value,
score_mod,
return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
)
```
Returns the max post mod scores from flex attention.
Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.
Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now
Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.
We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.
### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).
For now no grad, we can re-visit if needed.
## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.
```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │
│ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │
│ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │
│ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │
│ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │
│ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘
🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │
│ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │
│ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
Summary: This PR introduces shape guards to export. Previously only value ranges, equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program.
Test Plan:
updated several tests
Rollback Plan:
Differential Revision: D80713603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178
Approved by: https://github.com/tugsbayasgalan
Summary:
A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs.
Usage: use env var to specify a mode and necessary arguments.
FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode.
```
Different modes of the event tracker:
"0": Tracker not enabled (by default)
"1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb.
"2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt
"3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt
"4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt
```
FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`.
FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3".
Test Plan: New unit test
Reviewed By: georgiaphillips
Differential Revision: D79203595
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795
Approved by: https://github.com/ezyang
Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347
Approved by: https://github.com/eqy, https://github.com/atalman
F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due
to an arange call creating a new Tensor (not DTensor). This PR fixes it
by allowing implicit replication of Tensors for the arange call and the
one consumer of the arange call (the at::eq call).
Test Plan:
- new test. Also, F.one_hot(num_classes=-1) is broken so we skip that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307
Approved by: https://github.com/ezyang
ghstack dependencies: #162117
LOAF previously may skip these fusion opportunities and cause some tests fail.
Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton)
Notably:
* this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR)
* sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278
Approved by: https://github.com/atalman
ghstack dependencies: #162244, #162309
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.
Removing the alpha and beta fixes the issue.
Thanks @ngimel to figure out the root cause.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
Fixes static cuda launcher after https://github.com/triton-lang/triton/pull/7866.
Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162309
Approved by: https://github.com/aakhundov
ghstack dependencies: #162244
Follow-up to #161768.
Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished.
In #161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable.
It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock.
To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock)
Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162244
Approved by: https://github.com/atalman
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.
In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module.
Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162169
Approved by: https://github.com/zhxchen17
## Summary
This PR improves typing in ONNX-related modules by replacing TypeVar bound to Callable[..., Any] with ParamSpec to preserve parameter types and avoid type erasure in decorator functions.
## Changes
- `torch/onnx/_internal/exporter/_flags.py`: Replace TCallable TypeVar with ParamSpec
- `torch/onnx/ops/_impl.py`: Replace _T TypeVar with ParamSpec for _onnx_op decorator
- `torch/onnx/_internal/exporter/_torchlib/_torchlib_registry.py`: Replace _T TypeVar with ParamSpec
## Motivation
The previous implementation used TypeVar bound to Callable which erased parameter type information to Any. ParamSpec preserves the exact parameter types and return types, providing better type safety and IDE support.
## Testing
- Verified all changes compile and import correctly
- Created comprehensive test suite to validate ParamSpec functionality
- No linting errors introduced
- Maintains backward compatibility
Fixes#142306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162332
Approved by: https://github.com/Skylion007
This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated. The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli
1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get
2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build
3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist`
4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89]
5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled
6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM
We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date.
### Usage
* Install latest nightlies
```
pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \
--index-url https://download.pytorch.org/whl/nightly/cu129
```
* Install a specific version
```
pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \
vllm==1.0.0.dev20250903 \
xformers=0.0.33.dev20250903 \
flashinfer_python=0.2.14.dev20250903 \
--index-url https://download.pytorch.org/whl/nightly/cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000
Approved by: https://github.com/atalman
Summary:
A demo for creating AOTI delegate for NativeRT in OSS.
- It supports full graph lowering only.
- It leverages `executorch_call_delegate` HOP but doesn't rely on `executorch`.
- The delegate graph is obtained by tracing a `LoweredBackendModule` whose forward function calls `executorch_call_delegate`.
- The main difference between `executorch_call_delegate` and `aoti_call_delegate` is that the delegate graph from `executorch_call_delegate` doesn't have weights lifted as inputs.
- original_ep and delegate_ep are treated as flat EP dictionary and there is no nested structure.
- The naming contract is enforced by `model_name` and `backend_id`
Test Plan:
CI
Rollback Plan:
Differential Revision: D81641157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162285
Approved by: https://github.com/dolpm
I am unable to write a test that would fail here. The reason is that when we do _dynamo.disable(fn) in the compiled frame, the id of disabled function changes but currently we guard on the original function - `fn` whose id is not changing. This PR still guards on the `fn.__code__` just to be more precise.
Thanks to @thenumberouscode for pointing this out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162247
Approved by: https://github.com/StrongerXi, https://github.com/jansel
Summary:
If i have a EP that's exported on CPU and want to AOTI compile it for CUDA. I need to use `move_to_device_pass`.
But in `torch._inductor.aoti_compile_and_package()`, it directly uses the `example_inputs` attached to the EP, so we should move the example inputs as well if applicable.
Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_move_device_example_inputs
Rollback Plan:
Differential Revision: D81812366
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162301
Approved by: https://github.com/angelayi
Skiping renaming cause wrong dependencies when mutations are involved.
Test:
CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap
Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162028, #162221
## Summary
- We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816
- This is needed for backward pass of mxfp8 MoE training with grouped gemms
- Changes:
- Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm`
- Add meta registration input validation for mxfp8 grouped gemm, for composability with compile
- Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs
- Bump FBGEMM third party submodule to include:
- https://github.com/pytorch/FBGEMM/pull/4816
- https://github.com/pytorch/FBGEMM/pull/4820
- https://github.com/pytorch/FBGEMM/pull/4821
- https://github.com/pytorch/FBGEMM/pull/4823
#### How fbgemm dependency was bumped
Documenting this since I haven't found it documented elsewhere:
- `cd ~/pytorch/third_party/fbgemm`
- `git fetch`
- `git checkout <hash>`
- `cd ~/pytorch`
- `git add third_party/fbgemm`
## Test plan
#### Test build
```
USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e .
...
Successfully installed torch-2.9.0a0+gitf5070f3
```
[full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581)
#### Unit tests
```
pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_
...
test/test_matmul_cuda.py ......... [100%]
============================================================== 9 passed, 1668 deselected in 5.34s ===============================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209
Approved by: https://github.com/ngimel
# Feature
Currently, `torch._inductor.compile_aot` always uses the `WrapperFxCodegen` class. In contrast, Python and C++ codegen allow users to register custom backends. This PR brings that feature to FX codegen.
# Test plan
Added a CI test registering a custom FX backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162317
Approved by: https://github.com/jansel
When running bazel build, we (Google) run into the following error.
The `-Wctad-maybe-unsupported` warning would be raised to an error and break the build in certain cases.
So, we propose to suppress the warning to make the build with bazel more smooth.
This is the error message we got:
```
c10/util/IntrusiveList.h:166:12: error: 'std::reverse_iterator' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]
166 | return std::reverse_iterator{end()};
| ^
c10/test/util/IntrusiveList_test.cpp:24:18: note: in instantiation of member function 'c10::IntrusiveList<(anonymous namespace)::ListItem>::rbegin' requested here
24 | auto it = c1.rbegin();
| ^
c10/test/util/IntrusiveList_test.cpp:43:5: note: in instantiation of function template specialization '(anonymous namespace)::check_containers_equal<(anonymous namespace)::ListItem>' requested here
43 | check_containers_equal(l, v);
| ^
libcxx/include/__iterator/reverse_iterator.h:51:7: note: add a deduction guide to suppress this warning
51 | class reverse_iterator
| ^
1 error generated.
```
@haifeng-jin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162223
Approved by: https://github.com/ezyang
Fix the `DeviceMesh._flatten` docstring example of use. Alternative fix would be to replace `mesh_3d["dp", "cp"]` with `mesh_3d["cp", "tp"]`.
(I verified the fix using the `gloo` backend)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162277
Approved by: https://github.com/ezyang
# Summary
### Update
API
```Py
class AuxRequest(NamedTuple):
"""Request which auxiliary outputs to compute from flex_attention.
Each field is a boolean indicating whether that auxiliary output should be computed.
"""
lse: bool = False
max_scores: bool = False
class AuxOutput(NamedTuple):
"""Auxiliary outputs from flex_attention operation.
Fields will be None if not requested, or contain the tensor if requested.
"""
lse: Optional[Tensor] = None
max_scores: Optional[Tensor] = None
out_only = flex_attention(query, key, value, score_mod)
out_max, aux_max = flex_attention(
query,
key,
value,
score_mod,
return_aux=FlexAttentionAuxRequest(max_scores=True),
)
out_both, aux_both = flex_attention(
query,
key,
value,
score_mod,
return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
)
```
Returns the max post mod scores from flex attention.
Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.
Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now
Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.
We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.
### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).
For now no grad, we can re-visit if needed.
## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.
```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │
│ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │
│ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │
│ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │
│ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │
│ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘
🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │
│ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │
│ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │
│ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │
│ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │
│ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
`vmap(F.embedding)(DTensor, DTensor)` was failing because F.embedding's
batching rule generates a new tensor via at::arange, at::arange
generates a regular tensor, and DTensor rightfully errors on mixed
DTensor-regular Tensor operations.
This PR fixes the problem by activating DTensor implicit replication on
just the at::arange and the subsequent add operation.
In order to accomplish this I move the DTensor implicit replication flag
to C++ (most batching rules are in C++).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162117
Approved by: https://github.com/bdhirsh
- Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it.
- Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo.
Fixes#156632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633
Approved by: https://github.com/d4l3k
Adding a test that is closer to real use case. Thanks @mlazos for fixing a few issues so this test works for most cases.
We still have to skip the AOTI and dynamic case due to accuracy issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160782
Approved by: https://github.com/mlazos
# why
- gather everything up to make choices, without running
potentially expensive generators
- enables overrides where we toss the entire list of configs
from inductor, without having to enumrate it (expensive)
# what
- add a holding class that just gets all the components necessary
to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
the scene
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
ghstack dependencies: #162075, #161340, #161341, #161342, #161343, #161344, #161345, #161346
# why
- heuristics providers know decide whether to (or which choices to add)
in the max-autotune case
- enables an eventual override point to gracefully fallback to the
standard behavior
# what
- max-autotune is determined inside V.choices.get_mm_configs
because it's mm only right now, we can just do
`config.max_autotune or config.max_autotune_gemm`
a TODO indicates that this can change in the future when this
expands to more templates
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520573](https://our.internmc.facebook.com/intern/diff/D81520573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161344
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341, #161342, #161343
# why
- central point to analyze and override all generated choices
# what
- add a pseudo heuristic for aten that just yields a single, empty
kwargs
- add a pseudo heuristic with the bias_addmm logic for it
- add an addmm specific heuristic that yields a single choice, but
also expands it with alpha and beta kwargs
- replace all the aten.bind calls with V.choices.get_mm_configs
using the now matching API for aten
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520580](https://our.internmc.facebook.com/intern/diff/D81520580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161342
Approved by: https://github.com/jansel
ghstack dependencies: #162075, #161340, #161341
# why
- to have a central registry of templates/externkernelchoice
to match them to heuristics etc, they need unique names
- mm is both the triton template name and the aten_mm name
# what
- add a uid() to KernelTemplate/ExternKernelChoice that returns name
- override in ExternKernel to prepend "aten::"
- override in TritonTemplate to prepend "triton::"
This id is just use to find template heuristics, so it has no other
impact
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520579](https://our.internmc.facebook.com/intern/diff/D81520579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075, #161340
# why
- a step towards a unified interface for all choices, where any
adjustment to nodes (e.g. unsqueezing) happens as part of
choice specific preprocessing, behind a common point
# what
- move the unsqueeze logic for triton nodes for scaled_mm inside
the new hookup for adjusting the kernel inputs for template
heuristics
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "scale"
```
Differential Revision: [D81520582](https://our.internmc.facebook.com/intern/diff/D81520582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161340
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162075
Summary:
Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13
We will use FP32 for the scale vector multiplication and convert to the target dtype.
Upcasting helps with the following:
1. **Full CPU support**: `float32` has complete CPU kernel implementations for all operations
2. **Numerical stability**: `float32` provides more precision during intermediate calculations
3. **Compatibility**: Works across all devices (CPU/GPU) and PyTorch versions
Test Plan:
UTs
Rollback Plan:
Differential Revision: D81711093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202
Approved by: https://github.com/wwwjn
Avoid merges from extra PGO key, if same source has different rank. Unlikely to happen (needs code hash match & source variable type to change), but being safe.
Differential Revision: D81299840
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162097
Approved by: https://github.com/bobrenjc93
Summary: When running coordinate descent tuning the logging is difficult to parse if the results are parallelized at all. This includes the kernel name in each step so post-processing can unify the results, even if run in parallel.
Test Plan:
NFC. Just a logging change.
Rollback Plan:
Differential Revision: D80942794
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161409
Approved by: https://github.com/PaulZhang12
Summary:
The binary torch is running inside of can be larger than needed and in certain
situations, this can cause a loss of memory.
Test Plan:
We've manually run tests via
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_WORKER_SUPPRESS_LOGGING=0
make mc8-train-publish-cint-datafm-toy -C
minimal_viable_ai/models/ifr_mtml/main_v1/ 2>&1 | tee ~/run_out
```
and overriding the binary used to be the built fbpkg in /packages.
We've also kicked off manual runs at
```
fire-feid-20250903-1051-ae8c6827
```
Which do show the binary running - https://fburl.com/scuba/procprint/e6lwv32m
Rollback Plan:
steps:
- jk.update:
jk: pytorch/compiler:subproc_worker_binary
constant_bool: null
consistent_pass_rate: null
fractional_host_rollout: null
sampling_rate: null
- manual.note:
content: ''
Differential Revision: D81616624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162093
Approved by: https://github.com/masnesral
This is a reposting of PR #128519.
This change is important to how we maintain PyTorch at Google.
From the previous PR:
"
This will make the script more flexible for the directory where it is executed.
...
We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access.
"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772
Approved by: https://github.com/albanD
Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
Summary:
Fix memory leak in AOTI when calling `aoti_torch_as_strided`
If you have something like `AtenTensorHandle buf_handle`; and you allocated memory to it, you have to make it a `RAIIAtenTensorHandle` to release the ownership. Otherwise you have leaked the memory because even when the program ends, there's still a pointer pointing to the underlying storage of `buf_handle_restrided`, and the storage is never freed.
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_pad_non_zero_memory_leak
```
Also verified by looking at `print(f"Allocated memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")`
Differential Revision: D81640339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162118
Approved by: https://github.com/angelayi
This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test.
You can see more of a description of the problem in the code comments. But the TLDR is that:
* When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts
* We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per **AOT** graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i
* This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during **backward compilation**. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers
The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745
Approved by: https://github.com/ezyang, https://github.com/zou3519
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/4775
Without this change, Arm64 OSS pytorch build with FBGEMM failed with the following error.
Undefined symbols for architecture arm64:
"fbgemm::FindMinMax(float const*, float*, float*, long long)", referenced from:
at::native::fbgemm_linear_int8_weight_fp32_activation(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::Tensor const&) in QuantizedLinear.cpp.o
at::native::fbgemm_linear_quantize_weight(at::Tensor const&) in QuantizedLinear.cpp.o
PackedConvWeight<2>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o
PackedConvWeight<3>::apply_dynamic(at::Tensor const&, bool) in qconv_dynamic.cpp.o
at::Tensor PackedLinearWeight::apply_dynamic_impl<false>(at::Tensor, bool) in qlinear_dynamic.cpp.o
at::Tensor PackedLinearWeight::apply_dynamic_impl<true>(at::Tensor, bool) in qlinear_dynamic.cpp.o
ld: symbol(s) not found for architecture arm64
This change fixed the issue by moving FindMinMax's implementation from QuantUtilsAvx2.cc to QuantUtils.cc. FindMinMax is a platform-agnostic function with AVX2-specific optimizations so conceptually it can be put in QuantUtils.cc.
Test Plan:
With this change, Arm64 OSS pytorch built successfully with FBGEMM enabled.
Rollback Plan:
Reviewed By: q10
Differential Revision: D81052327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161527
Approved by: https://github.com/q10
…h.is_complex.
The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values.
Fixes#161859
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161951
Approved by: https://github.com/zou3519
Fixes#161080
torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument.
I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161688
Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
Summary:
Enables `torch.float32` and `torch.float16` options in
`torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`,
`mat_b`, and `out_dtype` are `torch.bfloat16`.
Saving for future PRs:
1. enabling testing on more platforms
2. supporting out_dtype != mat_a.dtype
3. opinfo
4. better compile support
Test Plan:
```bash
// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162059
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #161407, #161717
Summary:
Moves the `torch._grouped_mm` fallback from cuda-only code to a place
where it can be used by multiple backends. Specifically:
1. make the fallback path and util functions reusable and move them to
`ATen/native/GroupedMMUtils.h`
2. register a backend-agnostic kernel to composite explicit autograd key
3. refactor the grouped_mm tests to their own test case and enable CPU
At the end of this PR, here is the support matrix:
* CUDA SM90+: fast path with test coverage (no change)
* CUDA SM80+: fallback with test coverage (no change)
* CPU: fallback works, but without test coverage (new in this PR)
* other SM versions and other backends: will probably already work, but
let's leave this to future PRs
* float32/float16: will probably already work, but let's leave this to
future PRs
Test Plan:
```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161717
Approved by: https://github.com/ngimel, https://github.com/drisspg
ghstack dependencies: #161407
Summary:
Creates a fallback path for `torch._grouped_mm`, using the naive for
loop implementation (or bmm).
For the sake of keeping the PR small, this PR only enables SM80+ (CUDA
capability 8.0 and up), since I am testing this on an A100 machine. In
future PRs, we can increase the coverage of the fallback to:
1. float32 and float16, which will extend the GPU coverage
2. cpu
Test Plan:
```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161407
Approved by: https://github.com/drisspg, https://github.com/eqy
## Introduction
During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.
This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.
## Terms
* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.
## When can we reuse a block during capture?
### Strong Rule (Graph-Wide Safety)
This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.
> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.
Why it's safe:
This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.
### Per-stream Rule (A Practical Optimization)
The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.
In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.
> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.
In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.
## Implementation
* On `free(block)` during capture
* For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
* If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
* Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
* Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
* For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
* If yes, hand the block to S for immediate reuse within the same capture.
* If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
* Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.
## Examples (2 streams)
<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />
* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.
## Edge Case: Freeing after a join
Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).
In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.
## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel, https://github.com/eqy
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
This PR implements the semantics change to `torch._dynamo.error_on_graph_break`:
- ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~
- `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks.
- `error_on_graph_break` does nothing when `fullgraph=True`
- `error_on_graph_break` does NOT guarantee a single graph
Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation:
- `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled
- `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time
- `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747
Approved by: https://github.com/mlazos
ghstack dependencies: #161739
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](83c5a5a551), includes:
- Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed
- Fallback lu_factor kernel to CPU for single batch
- Enable aten::linalg_inv and aten::linalg_inv_ex on XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162062
Approved by: https://github.com/EikanWang
`CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found.
Instead prepend the path to the existing value.
This fixes a test failure:
> pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package
> self.assertTrue(so_path.exists())
> AssertionError: False is not true
Caused by:
```
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory
collect2: error: ld returned 1 exit status
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161907
Approved by: https://github.com/Skylion007
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.
This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862
Approved by: https://github.com/ezyang, https://github.com/mlazos
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`.
I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb
The greatest differences are consistent and the same on both CPU architectures:
```
Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed)
```
Hence I assume this is in the expected tolerances especially as `complex128` and all other types pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152424
Approved by: https://github.com/malfet
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.
When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.
To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.
Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```
Differential Revision: D81530083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary: This is a reland of D80285441, fixed the unit test.
Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
will succeed after this diff.
Rollback Plan:
Differential Revision: D80971224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521
Approved by: https://github.com/frank-wei
Print out amp target dtype and let custom backend easier find out expected dtype while integration.
## Test Result
### Before
```python
In [1]: import torch
...: import torch_openreg
...:
...: a = torch.randn(3, 4)
...: b = torch.randn(4, 2)
...: with torch.autocast("openreg", dtype=torch.float16):
...: torch.mm(a, b)
...:
/home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype is not supported. Disabling autocast.
openreg Autocast only supports dtypes of torch.float32 currently.
warnings.warn(error_message
```
### After
```python
In [1]: import torch
...: import torch_openreg
...:
...: a = torch.randn(3, 4)
...: b = torch.randn(4, 2)
...: with torch.autocast("openreg", dtype=torch.float16):
...: torch.mm(a, b)
...:
/home/coder/code/pytorch/torch/amp/autocast_mode.py:332: UserWarning: In openreg autocast, but the target dtype torch.float16 is not supported. Disabling autocast.
openreg Autocast only supports dtypes of torch.float32 currently.
warnings.warn(error_message)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162037
Approved by: https://github.com/zou3519
In this pr, we port test/distributed/tensor test filesfor Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:
Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161604
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Previously in gh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1
This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.
This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.
We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.
If we detect that there is normalization needs, the older path will be invoked.
Fixes#162113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162111
Approved by: https://github.com/msaroufim
## Summary
Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.
## Background
On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
))
```
is a lot slower than:
```
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.
## Data
I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
addmm_16448x512x2048: +0.14%
addmm_128x2048x2048: +0.01%
addmm_128x768x1000: +0.75%
addmm_12672x3072x768: +1.08%
addmm_512x768x32000: +0.62%
addmm_12608x384x384: +0.00%
addmm_4160x1024x4096: +0.90%
addmm_16x768x2: +0.56%
addmm_12608x3072x768: +0.09%
addmm_64x4096x1000: +2.77%
addmm_256x1024x512: +1.99%
addmm_30x256x256: +1.12%
addmm_100480x128x384: +0.91%
addmm_6400x2048x512: +0.25%
addmm_61568x1024x256: +0.08%
addmm_1x768x768: +0.93%
addmm_12544x384x384: +0.19%
addmm_128x512x1000: +0.77%
addmm_2048x128x128: +1.32%
addmm_128x3072x1000: +0.24%
addmm_7936x512x2048: +0.07%
addmm_8192x512x2048: +0.33%
addmm_64x1024x1000: +1.43%
addmm_128x2304x1000: +0.01%
addmm_32768x256x2: +0.75%
addmm_64x384x1152: +0.79%
addmm_64x640x1000: +0.01%
addmm_100480x128x128: +0.87%
addmm_1152x3072x768: +1.13%
addmm_8192x256x2048: +1.40%
addmm_4096x128x768: +0.01%
addmm_128x2560x1000: +0.01%
addmm_12544x2048x512: +0.43%
addmm_200704x24x96: +0.14%
addmm_8448x512x2048: +0.96%
addmm_50176x256x1024: +0.62%
addmm_4160x4096x1024: +0.22%
addmm_4096x768x768: +0.32%
addmm_220x2048x512: +0.56%
addmm_8x2048x1000: +1.12%
addmm_256x197951x512: +26.99%
addmm_401536x64x192: +0.60%
addmm_2040x2048x512: +0.47%
addmm_512x1024x256: +1.32%
addmm_128x4096x1000: +1.67%
addmm_12672x768x768: +0.34%
addmm_128x368x1000: +0.77%
addmm_96x1280x1000: +0.01%
addmm_12544x512x2048: +0.41%
addmm_6272x320x1280: +0.76%
addmm_12544x3072x768: +0.09%
addmm_64x384x1000: +0.39%
mm improvements when best:
mm_200704x128x512: +1.29%
mm_663552x16x16: +0.80%
mm_4096x768x768: +0.51%
mm_131072x64x31: +0.24%
mm_12544x1152x384: +0.11%
mm_128x2048x2: +0.46%
mm_262144x16x23: +0.62%
mm_50176x576x192: +0.37%
mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================
Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%
Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)
Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%
```
## Results
The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```
It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
addmm_128x16384x128: +0.18%
addmm_128x262144x256: +38.24%
addmm_128x200000x512: +14.76%
addmm_256x800000x128: +0.06%
addmm_131072x128x256: +0.27%
addmm_128x256x131072: +0.25%
addmm_2048x200000x64: +12.45%
mm improvements when best:
mm_128x16384x128: +0.18%
mm_128x262144x256: +38.05%
mm_128x200000x512: +9.47%
mm_256x800000x128: +0.99%
mm_512x6400000x256: +3.17%
mm_524288x64x64: +0.29%
mm_2048x200000x64: +11.19%
mm_8192x1000000x256: +34.14%
mm_128x4096x100000: +0.40%
mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================
Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%
Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%
```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.
Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866
## Test Plan:
New unit tests.
Differential Revision: D80771648
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241
Approved by: https://github.com/eellison
This is only user outputs which is what we want. Spoke to @zhxchen17 though and it seems like nativeRT might have some bugs on propogating updates to things like input mutation or buffer mutation though. Something to take a look at in a follow up.
Also I have no idea where the nativeRT tests are. Any pointers @zhxchen17 @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161723
Approved by: https://github.com/zhxchen17
Renaming `set_fullgraph` to `error_on_graph_break` for now. There are no semantic differences yet. In a followup PR, we will introduce a new `torch.compile` option `error_on_graph_break` that has lower priority than `fullgraph` so that `fullgraph` really returns 1 graph.
I could keep `set_fullgraph` as a deprecated alias for `error_on_graph_break` for now, but I'm hoping that won't be necessary since it's still private API (there are no internal callsites yet, and there are no significant OSS callsites yet).
cc @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela @mlazos @guilhermeleobas @xmfan as primary users for `set_fullgraph`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161739
Approved by: https://github.com/xmfan, https://github.com/Lucaskabela, https://github.com/anijain2305, https://github.com/mlazos
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency
Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D80174674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
Summary:
[reland]
Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D81334984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161794
Approved by: https://github.com/zhxchen17
pytest test/dynamo/test_aot_compile.py -k test_aot_compile_graph_break_error_fmt
before
```
Traceback (most recent call last):
File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module>
aot_compiled_fn = compiled.aot_compile((example_inputs, {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 717, in aot_compile
return aot_compile_fullgraph(
^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/aot_compile.py", line 132, in aot_compile_fullgraph
capture_output = convert_frame.fullgraph_capture(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 947, in fullgraph_capture
dynamo_output = compile_frame(
^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 1020, in compile_frame
bytecode, tracer_output = transform_code_object(code, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/bytecode_transformation.py", line 1592, in transform_code_object
tracer_output = transformations(instructions, code_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 992, in transform
tracer_output = trace_frame(
^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 312, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 821, in trace_frame
run_tracer()
File "/data/users/$USER/pytorch/torch/_dynamo/convert_frame.py", line 803, in run_tracer
tracer.run()
File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1472, in run
while self.step():
^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1342, in step
self.dispatch_table[inst.opcode](self, inst)
File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 902, in wrapper
return inner_fn(self, inst)
^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3364, in CALL
self._call(inst)
File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 3358, in _call
self.call_function(fn, args, kwargs)
File "/data/users/$USER/pytorch/torch/_dynamo/symbolic_convert.py", line 1260, in call_function
self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/variables/lazy.py", line 212, in realize_and_forward
return getattr(self.realize(), name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/variables/functions.py", line 1513, in call_function
unimplemented_v2(
File "/data/users/$USER/pytorch/torch/_dynamo/exc.py", line 596, in unimplemented_v2
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
Explanation: User-inserted graph break. Message: None
Hint: Remove the `torch._dynamo.graph_break()` call.
Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html
```
after
```
Traceback (most recent call last):
File "/data/users/$USER/vllm-tests/graph-break.py", line 15, in <module>
aot_compiled_fn = compiled.aot_compile((example_inputs, {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 737, in aot_compile
raise e.with_traceback(None) from e.__cause__ # User compiler error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
Explanation: User-inserted graph break. Message: None
Hint: Remove the `torch._dynamo.graph_break()` call.
Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html
from user code:
File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo
torch._dynamo.graph_break()
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```
consistent w/ std torch.compile
```
Traceback (most recent call last):
File "/data/users/$USER/vllm-tests/graph-break.py", line 16, in <module>
res = compiled(*example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/$USER/pytorch/torch/_dynamo/eval_frame.py", line 850, in compile_wrapper
raise e.with_traceback(None) from e.__cause__ # User compiler error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.Unsupported: Call to `torch._dynamo.graph_break()`
Explanation: User-inserted graph break. Message: None
Hint: Remove the `torch._dynamo.graph_break()` call.
Developer debug context: Called `torch._dynamo.graph_break()` with args `[]`, kwargs `{}`
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0025.html
from user code:
File "/data/users/$USER/vllm-tests/graph-break.py", line 5, in foo
torch._dynamo.graph_break()
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162005
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
Reason:
rocm binary builds should not block viable strict upgrade. It is queuing/canceled so viable strict is 1.2 days old
Tested by mangling the workflow file to get to the actual call of the python script `python ../test-infra/tools/scripts/fetch_latest_green_commit.py --required-checks '["pull", "trunk", "lint", "^linux-binary-manywheel$", "^linux-binary-libtorch-release$", "linux-aarch64"]' --viable-strict-branch viable/strict --main-branch master`, which I then ran locally where I have credentials. It returned d64718503728001a1e78168fd7f2d4ff23e57285 which is green. Without this change, it returns 5e5870e858f60ff4bf87d03f3592097e934a9580, which is pretty old
The other solution would have been to mark it as unstable I think
Side note, why is it master and how is it working like that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162100
Approved by: https://github.com/huydhn
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:
* Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
- AITER ASM kernels deliver over 500TFLOPS training performance. See
[AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more
details.
* Now returns natural based `logsumexp` tensor, matching CUDA's behavior
- PR #156903 is reverted in this PR as well since it is not needed anymore.
* Enables `CausalVariant.LOWER_RIGHT`
The build system changes drastically along with new packaging scheme of
AOTriton 0.11
* AOTriton 0.11 packs GPU images separately from AOTriton runtime
* `aotriton.cmake` now selectively downloads image packs according to
`PYTORCH_ROCM_ARCH`
* `aotriton.cmake` now only use pre-compiled runtime library that exactly
matches the ROCM in the build environment. For PyTorch builds with ROCm
versions not listed in the file, the build process will build AOTriton
runtime without GPU images from source
- This avoids any further ABI breaks like ROCM 6.4 -> 7.0
- recursive git clone is disabled since building AOTriton runtime does not
require submodules.
Bug fixes:
* Fix a kernel bug introduced when implementing SWA
Known Problems:
* gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
due to accuracy issues. Triton compiler fixes are needed to restore the
support status.
* Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
This issue is under investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Summary:
The saving (serialization) part of PT2 archive weight refactoring.
The loading (deserialization part) has been landed in D80035490
Test Plan:
CI
Rollback Plan:
bifferential Revision: D80970931
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161520
Approved by: https://github.com/SherlockNoMad
**Summary:** Previously, we call `assert_and_get_unqiue_device` once per node in both prepare and convert. This is expensive and unnecessary since the model device is the same across all nodes, so we should just call this once in the beginning and reuse the same model device across all the nodes.
**Test Plan:**
python test/test_quantization.py -k TestQuantizePT2E
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159901
Approved by: https://github.com/jerryzh168
Summary:
Our strategy for detecting fake tensor leakage in non-strict for outside scope (side effects happening outside of model.forward) is:
1. We do gc.collect() before export and get the alive fake tensors
2. We dump the proxy to fake tensor map from make_fx tracer
3. We query gc again to get alive fake tensors
4. We take the delta between (1) and (3)
5. Filter out fake tensors that are:
1. Associated with `TrackedFake` (input tracking thing in symbolic_shapes)
2. Associated with `gm.meta`
6. Do ID match with the proxies and emit their stacktraces.
We rely on (https://github.com/pytorch/pytorch/pull/159923) for other sources of leakages such as:
1. We failed to proxy an operator (like param.data)
2. We cache some tensor in model.forward (https://github.com/pytorch/pytorch/issues/155114)
In general, we notice `gc.collect()` and query-ing gc for live objects are kinda slow. So we turn on this feature under env variable. We should document on export public facing documents that if you run into weird errors regarding fake tensors, they should look into turning on this env variable for further analysis.
Test Plan:
Test plan
Rollback Plan:
Differential Revision: D80003204
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160456
Approved by: https://github.com/pianpwk
### Motivation
* MI250 Cirrascale runners are currently having network timeout leading to huge queueing of binary smoke test jobs:
<img width="483" height="133" alt="image" src="https://github.com/user-attachments/assets/17293002-78ad-4fc9-954f-ddd518bf0a43" />
* MI210 Hollywood runners (with runner names such as `pytorch-rocm-hw-*`) are not suitable for these jobs, because they seem to take much longer to download artifacts: https://github.com/pytorch/pytorch/pull/153287#issuecomment-2918420345 (this is why these jobs were specifically targeting Cirrascale runners). However, it doesn't seem like Cirrascale runners are necessarily doing much better either e.g. [this recent build](https://github.com/pytorch/pytorch/actions/runs/17332256791/job/49231006755).
* Moving to MI325 runners should address the stability part at least, while also reducing load on limited MI2xx runner capacity.
* However, I'm not sure if the MI325 runners will do any better on the artifact download part (this may need to be investigated more) cc @amdfaa
* Also removing `ciflow/binaries` and `ciflow/binaries_wheel` label/tag triggers for `generated-linux-binary-manywheel-rocm-main.yml` because we already trigger ROCm binary build/test jobs via these labels/tags in `generated-linux-binary-manywheel-nightly.yml`. And for developers who want to trigger ROCm binary build/test jobs on their PRs, they can use the `ciflow/rocm-mi300` label/tag as per this PR.
### TODOs (cc @amdfaa):
* Check that the workflow runs successfully on the MI325 runners in this PR. Note how long the test jobs take esp. the "Download Build Artifacts" step
* Once this PR is merged, clear the queue of jobs targeting `linux.rocm.gpu.mi250`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162044
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
# why
- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
for codegen and heuristics
- unified place where we can just adjust them for the template
# what
- inside get_mm_configs, return not the passed in kernel inputs,
but allow the template heuristic to adjust them if necessary
- the default implementation right now just passes them back
this diff just adds the functionality, but does not exercise it
other than the default (passthrough)
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
## Summary
Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently.
## Background
On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower.
For example:
```
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176]))
))
```
is a lot slower than:
```
args[0]: TensorBox(StorageBox(
InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1]))
))
args[1]: TensorBox(StorageBox(
InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1]))
))
```
This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels.
## Data
I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups:
```
Parsed 420 unique shapes from benchmark output
addmm improvements when best:
addmm_16448x512x2048: +0.14%
addmm_128x2048x2048: +0.01%
addmm_128x768x1000: +0.75%
addmm_12672x3072x768: +1.08%
addmm_512x768x32000: +0.62%
addmm_12608x384x384: +0.00%
addmm_4160x1024x4096: +0.90%
addmm_16x768x2: +0.56%
addmm_12608x3072x768: +0.09%
addmm_64x4096x1000: +2.77%
addmm_256x1024x512: +1.99%
addmm_30x256x256: +1.12%
addmm_100480x128x384: +0.91%
addmm_6400x2048x512: +0.25%
addmm_61568x1024x256: +0.08%
addmm_1x768x768: +0.93%
addmm_12544x384x384: +0.19%
addmm_128x512x1000: +0.77%
addmm_2048x128x128: +1.32%
addmm_128x3072x1000: +0.24%
addmm_7936x512x2048: +0.07%
addmm_8192x512x2048: +0.33%
addmm_64x1024x1000: +1.43%
addmm_128x2304x1000: +0.01%
addmm_32768x256x2: +0.75%
addmm_64x384x1152: +0.79%
addmm_64x640x1000: +0.01%
addmm_100480x128x128: +0.87%
addmm_1152x3072x768: +1.13%
addmm_8192x256x2048: +1.40%
addmm_4096x128x768: +0.01%
addmm_128x2560x1000: +0.01%
addmm_12544x2048x512: +0.43%
addmm_200704x24x96: +0.14%
addmm_8448x512x2048: +0.96%
addmm_50176x256x1024: +0.62%
addmm_4160x4096x1024: +0.22%
addmm_4096x768x768: +0.32%
addmm_220x2048x512: +0.56%
addmm_8x2048x1000: +1.12%
addmm_256x197951x512: +26.99%
addmm_401536x64x192: +0.60%
addmm_2040x2048x512: +0.47%
addmm_512x1024x256: +1.32%
addmm_128x4096x1000: +1.67%
addmm_12672x768x768: +0.34%
addmm_128x368x1000: +0.77%
addmm_96x1280x1000: +0.01%
addmm_12544x512x2048: +0.41%
addmm_6272x320x1280: +0.76%
addmm_12544x3072x768: +0.09%
addmm_64x384x1000: +0.39%
mm improvements when best:
mm_200704x128x512: +1.29%
mm_663552x16x16: +0.80%
mm_4096x768x768: +0.51%
mm_131072x64x31: +0.24%
mm_12544x1152x384: +0.11%
mm_128x2048x2: +0.46%
mm_262144x16x23: +0.62%
mm_50176x576x192: +0.37%
mm_131072x16x31: +0.26%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================
Operation: addmm
----------------------------------------
Total shapes analyzed: 247
Average Subgraph placement: 3.38
Median Subgraph placement: 2.0
Subgraph is best choice: 52/247 shapes (21.1%)
Average improvement when best: 1.15%
Median improvement when best: 0.58%
Largest improvement when best: +26.99%
Operation: bmm
----------------------------------------
Total shapes analyzed: 85
Average Subgraph placement: 24.00
Median Subgraph placement: 21.0
Subgraph is best choice: 0/85 shapes (0.0%)
Average improvement when best: N/A (never best)
Median improvement when best: N/A (never best)
Largest improvement when best: N/A (never best)
Operation: mm
----------------------------------------
Total shapes analyzed: 88
Average Subgraph placement: 15.08
Median Subgraph placement: 4.0
Subgraph is best choice: 9/88 shapes (10.2%)
Average improvement when best: 0.52%
Median improvement when best: 0.46%
Largest improvement when best: +1.29%
```
## Results
The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune:
```
addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436
addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702
addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834
addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105
...
```
Compared to the non-transposed autotune:
```
addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421
addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145
addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803
addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684
addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246
addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547
addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895
addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916
```
It seems to perform really well for high values of `K` vs `N` and `M`.
Testing this hypothesis with some custom shapes:
```
Parsed 64 unique shapes from benchmark output
addmm improvements when best:
addmm_128x16384x128: +0.18%
addmm_128x262144x256: +38.24%
addmm_128x200000x512: +14.76%
addmm_256x800000x128: +0.06%
addmm_131072x128x256: +0.27%
addmm_128x256x131072: +0.25%
addmm_2048x200000x64: +12.45%
mm improvements when best:
mm_128x16384x128: +0.18%
mm_128x262144x256: +38.05%
mm_128x200000x512: +9.47%
mm_256x800000x128: +0.99%
mm_512x6400000x256: +3.17%
mm_524288x64x64: +0.29%
mm_2048x200000x64: +11.19%
mm_8192x1000000x256: +34.14%
mm_128x4096x100000: +0.40%
mm_128x3072x150000: +0.27%
================================================================================
BENCHMARK ANALYSIS RESULTS
================================================================================
Operation: addmm
----------------------------------------
Total shapes analyzed: 33
Average Subgraph placement: 4.39
Median Subgraph placement: 2.0
Subgraph is best choice: 7/33 shapes (21.2%)
Average improvement when best: 9.46%
Median improvement when best: 0.27%
Largest improvement when best: +38.24%
Operation: mm
----------------------------------------
Total shapes analyzed: 30
Average Subgraph placement: 7.63
Median Subgraph placement: 2.0
Subgraph is best choice: 10/30 shapes (33.3%)
Average improvement when best: 9.81%
Median improvement when best: 2.08%
Largest improvement when best: +38.05%
```
## Conclusion
Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes.
Data gathering scripts:
https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866
## Test Plan:
New unit tests.
Differential Revision: D80771648
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161241
Approved by: https://github.com/eellison
Summary:
Inductor has the following configurations:
config.comprehensive_padding
config.padding_alignment_bytes
config.padding_stride_threshold
In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today.
This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic.
config.pad_dynamic_shapes
In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic.
Test Plan:
CI
Rollback Plan:
Differential Revision: D80468808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160997
Approved by: https://github.com/blaine-rister, https://github.com/jansel
In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:
1. Use torch.accelerator for general gpu
2. Skip the case if running on xpu which has known issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161261
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse.
Reviewed By: davidberard98, FindHao
Differential Revision: D80994791
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161953
Approved by: https://github.com/FindHao
Summary:
we have
```
std::vector<size_t> constants_internal_offset(
num_constants - num_folded_constants);
```
but the for loop does not consider it
```
for (size_t i = 0; i < num_constants; i++) {
...
constants_internal_offset[i]
...
```
even in the for loop, it does
```
bool from_folded = this->constant_from_folded(i);
if (from_folded) {
continue;
}
```
but `i` could still be wrong
Rollback Plan:
Differential Revision: D81425007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161887
Approved by: https://github.com/angelayi
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.
Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.
Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586
Approved by: https://github.com/ezyang
ghstack dependencies: #161466
Summary:
This is a re-land of [PR161040](https://github.com/pytorch/pytorch/pull/161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs.
This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```
Test Plan:
```
pytest test/inductor/test_max_autotune.py
pytest test/inductor/test_triton_heuristics.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161996
Approved by: https://github.com/coconutruben
As the tile stated.
As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845
Approved by: https://github.com/albanD
Adds fallback commands for the following:
* python setup.py install
* python setup.py develop
Ideally these should just work and should provide backwards compat.
Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone.
This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009
Approved by: https://github.com/clee2000, https://github.com/atalman
# why
- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
for codegen and heuristics
- unified place where we can just adjust them for the template
# what
- inside get_mm_configs, return not the passed in kernel inputs,
but allow the template heuristic to adjust them if necessary
- the default implementation right now just passes them back
this diff just adds the functionality, but does not exercise it
other than the default (passthrough)
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
# why
- another step towards get_mm_configs providing
all the kwargs needed to add a choice from
a template. This in turn will allow us to send
all templates through one single call, and handle modifications
# what
- use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the suffix args
and epilogue function/fn for scaled_mm
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D80670914](https://our.internmc.facebook.com/intern/diff/D80670914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161126
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125
# why
- another step towards get_mm_configs providing
all the kwargs needed to add a choice from
a template. This in turn will allow us to send
all templates through one single call, and handle modifications
# what
- use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the prefix args
and epilogue function/fn for addmm/baddbmm
- expand kernelinputs to also be able to shuttle around non tensor
inputs (scalars) as is needed for alpha and beta
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k addmm
```
Differential Revision: [D80670912](https://our.internmc.facebook.com/intern/diff/D80670912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161125
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124
# why
- another step towards get_mm_configs providing
all the kwargs needed to add a choice from
a template. This in turn will allow us to send
all templates through one single call, and handle modifications
# what
use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the workspace_arg for
all the tma templates
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k tma
```
Differential Revision: [D80670915](https://our.internmc.facebook.com/intern/diff/D80670915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161124
Approved by: https://github.com/jansel
ghstack dependencies: #161123
# why
- some kwargs are choice independent but rather
always the same for a specific op or template
- this enables us to track those differently than the
choice ones, and thus enables interception of them
cleaner
- maybe_append_choices can then be simplified to
just pass through the kwargs
# what
- hookup for template heuristics to have per template/op extra
kwargs that are always the same, for all choices
- hookup for the called to get_mm_configs to provide template/op
kwargs to override some of the template/choice kwargs
this pr does not use the new machinery, and everything is empty
for now. subsequent prs start using it to simplify ops
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D80670916](https://our.internmc.facebook.com/intern/diff/D80670916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161123
Approved by: https://github.com/jansel
The functions here expect to return pointers, but currently aren't returning anything. Make them return NULL.
The properties array wants an extra set of braces. One pair for the array, another for the first item in the array.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161792
Approved by: https://github.com/Skylion007
As titled, so that the `getmem` calls in the loop are non-blocking, so that we max out the issuance rate.
Also had a single `nvshmem_quiet()` at the end to make sure all the getmem calls complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162006
Approved by: https://github.com/ngimel
Previously, without calling `torch.empty` before NVSHMEM init, we see error below:
```
src/host/init/init.cu:nvshmemi_check_state_and_init:1117: nvshmem initialization failed, exiting
src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
```
Fixing it by calling a `cudaFree(nullptr)` to make sure CUDA runtime is initialized before NVSHMEM init.
Removing all `torch.empty(1)` calls from tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161232
Approved by: https://github.com/ngimel
ghstack dependencies: #161214
We previously assume aot precompile should only work on non closures. This is hard to enforce in practice because we will see a lot of cases with decorater (e.g. hugging face models)
```
def check_inputs(fn):
def _fn(self, *args, **kwargs):
for arg in args:
assert arg.shape[0] > 1
return fn(*args, **kwargs)
return _fn
@check_inputs
def foo(x, y):
a = x + x
b = y + y
c = a + b
return c
```
It doesn't make sense to not support these cases since they are straightfowrad to do.
This PR adds the logic to handle closure and make sure they can be precompiled properly.
Differential Revision: [D81509535](https://our.internmc.facebook.com/intern/diff/D81509535/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161990
Approved by: https://github.com/angelayi
This patch enables hipblaslt backend tests for test_mm_bmm_dtype_overload and test_addmm_baddmm_dtype_overload.
Tests were disabled as part of #150812
Rocblas backend tests are not enabled yet, WIP.
Test command
PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_mm_bmm_dtype_overload' -v PYTORCH_TEST_WITH_ROCM=1 pytest test/test_matmul_cuda.py -k 'test_addmm_baddmm_dtype_overload' -v
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161540
Approved by: https://github.com/jeffdaily
We basically follow the same pattern we do for tensor arguments. The major downside is we now have to traverse the entirety of the int list / etc where previously we didn't have. Benchmark suggests 2% regression for relevant things.
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160256
Approved by: https://github.com/albanD
Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved.
- Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly
- Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused
@albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and
`torch/autograd/_functions/utils.py`? Thanks!
## BC Breaking
- **Deprecated members in `torch.onnx.verification` are removed**
Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323
Approved by: https://github.com/titaiwangms, https://github.com/angelayi
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.
This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161862
Approved by: https://github.com/ezyang
Fixes#160053
The previous error message `Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now` was not clear
now we have
```
python3
Python 3.13.5 | packaged by conda-forge | (main, Jun 16 2025, 08:27:50) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
... import torch.nn.functional as F
... a = torch.empty(2,2,2,2)
... F.pad(a, (1,1), mode="circular")
...
Traceback (most recent call last):
File "<python-input-0>", line 4, in <module>
F.pad(a, (1,1), mode="circular")
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rrathaur/Desktop/pytorch/torch/nn/functional.py", line 5294, in pad
return torch._C._nn.pad(input, pad, mode, value)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: Padding size 2 is not supported for 4D input tensor.
Supported combinations for non-constant padding:
- 2D or 3D input: padding size = 2 (pads last dimension)
- 3D or 4D input: padding size = 4 (pads last 2 dimensions)
- 4D or 5D input: padding size = 6 (pads last 3 dimensions)
>>>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160866
Approved by: https://github.com/mikaylagawarecki
Before this PR, the `_select_strategy` always selects the first strategy with minimum redistribute cost. This causes unexpected behavior when
- multiple strategies have 0 redistribute costs
- the first one with 0 redistribute cost may perform local chunking
E.g. in memory efficient SDPA, the default orders of candidate strategies have a `Shard(2)` one before the `Replicate()` one. https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_matrix_ops.py#L500-L512
When the input is `Replicate()`, `_select_strategy` will pick the `Shard(2)` strategy and do local chunking first, before local computation. This is clearly unexpected to users.
In this PR, we improve `_select_strategy` so that when multiple strategies have 0 redistribute cost, we prioritize the one which keeps input unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161882
Approved by: https://github.com/ezyang
Fixes#161375
The "Using the visualizer" section in torch_cuda_memory.md had a link to https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link.
I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426
Approved by: https://github.com/soulitzer
The error message here implies that we can only call `self.persistent_load(...)` for ints or tuples, but due to the second part of the type check being inverted, weights-only unpickler will throw an exception iff `pid` is an int.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161661
Approved by: https://github.com/Skylion007
We have 4 different version of inductor benchmark Docker images used in CI at the moment:
1. `pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks` is used by almost all inductor jobs including nightly benchmark
2. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.12
3. `pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks` runs inductor unit tests with python 3.13
4. `pytorch-linux-jammy-py3-gcc11-inductor-benchmarks` runs inductor unit tests on CPU
My proposal here is to clean up (2) and (3) and to keep (1) under the same setup from https://ghcr.io/pytorch/torchbench. Simplicity is the key here as inductor workflows are getting more and more complex:
1. Unit tests for Python variant like 3.12 and 3.13 were useful when they were first added to CI. They are much less useful now. [Flambeau](https://hud.pytorch.org/flambeau/s/3876ec7b-43f0-42c6-bfbf-899035e5bb77) shows a 0.97 correlation between them. And we are also moving to 3.14 nowadays. I want to choose 3.12 for (1), but will do this separately. This is also what TorchBench and vLLM are using on CI.
1. We are gradually cleaning up 3.9 on CI https://github.com/pytorch/pytorch/issues/161167
Another BE change here is to rename the jobs various inductor workflows because I think names like `linux-jammy-cuda12_8-py3_10-gcc9-inductor-build` is too long and confusing to look at, better just use human-friendly names like `inductor-build`. Other information is already spelled out in the build environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161536
Approved by: https://github.com/zou3519
## Introduction
During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.
This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.
## Terms
* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.
## When can we reuse a block during capture?
### Strong Rule (Graph-Wide Safety)
This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.
> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.
Why it's safe:
This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.
### Per-stream Rule (A Practical Optimization)
The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.
In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.
> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.
In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.
## Implementation
* On `free(block)` during capture
* For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
* If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
* Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
* Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
* For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
* If yes, hand the block to S for immediate reuse within the same capture.
* If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
* Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.
## Examples (2 streams)
<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />
* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.
## Edge Case: Freeing after a join
Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).
In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.
## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fixes#160437
Summary:
This PR avoids compiling empty FX graphs generated during graph breaks. If there are no calls in the graph, we can just return the empty list of instructions.
More precisely,
In compile_and_call_fx_graph, if the FX graph contains no calls (count_calls(self.graph) == 0) and the return value list is empty, we now return an empty instruction list immediately
Impact:
module: dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160536
Approved by: https://github.com/Lucaskabela
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
`get_remote_tensor `: return a symmetric tensor given a peer rank.
The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.
As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
(Porting most of #161008)
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
**Summary**
This PR improves the performance of A16W8 GEMM template by
- Removing the config with block_n=48 & block_m=16 as it is not very efficient.
- Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31.
- Converting int8 values to bf16 with intrinsics instead of `at::vec::convert` as the latter does not have optimized implementation for this case.
We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct.
**Test plan**
Already covered by UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161148
Approved by: https://github.com/CaoE, https://github.com/jansel
The way AsyncAllreduceCUDADeviceWork is currently implemented,
using it will force a copy of `shared_ptr<gloo::Context>`
because `std::move` does nothing for a const ref.
This PR changes the param type to shared_ptr<> instead of the
const ref. This allows more efficient parameter passing.
Here's an example that demonstrates the issue:
```cpp
#include <memory>
#include <iostream>
struct Foo {};
void useFoo_ref(const std::shared_ptr<Foo>& f) {
std::shared_ptr<Foo> internal = std::move(f);
std::cout << "use_count: " << internal.use_count() << '\n';
}
void useFoo_val(std::shared_ptr<Foo> f) {
std::shared_ptr<Foo> internal = std::move(f);
std::cout << "use_count: " << internal.use_count() << '\n';
}
int main() {
std::shared_ptr<Foo> f1 = std::make_shared<Foo>();
useFoo_ref(std::move(f1)); // prints "use_count: 2"
std::shared_ptr<Foo> f2 = std::make_shared<Foo>();
useFoo_val(std::move(f2)); // prints "use_count: 1"
}
```
This also aligns well with [C++ Core Guidelines][1] for handling
smart pointers.
[1]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines?utm_source=chatgpt.com#Rr-summary-smartptrs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161834
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
Referring to the signatures and functions of `Stream` and `Event` in CUDA, we use CPU multithreading
and conditional variables to implement equivalent capabilities as the underlying foundation of torch_openreg.
**Changes:**
- Add stream capabilities for OpenReg
- Add event capabilities for OpenReg
- Add kernel launch entrypoint for OpenReg
- Add testcases about stream and event for OpenReg
- Add example for OpenReg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160099
Approved by: https://github.com/albanD
ghstack dependencies: #161603
This will allow for some more cases to use tensor descriptors e.g. before the following block params would not match
because the innermost dimension does not have stride 1
```python
block_params=BlockParameters(shape=[64, 4, 1, 1], block_shape=[((XBLOCK + 3)//4), Min(4, XBLOCK), 1, 1], strides=[0, 1, 0, 0], offsets=[(xoffset//4), ModularIndexing(xoffset, 1, 4), 0, 0])
```
After broadcasting dimensions and singleton dimensions are removed:
```python
block_params=BlockParameters(shape=[4], block_shape=[Min(4, XBLOCK)], strides=[1], offsets=[ModularIndexing(xoffset, 1, 4)])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161602
Approved by: https://github.com/jansel
Update the torch-xpu-ops commit to [8b58040ee32689487f660462f655085f31506dab](8b58040ee3), includes:
- Add vectorization path on maxpool forward channel last
- Add FlightRecorder support for ProcessGroupXCCL
- Fix random build failure on codegen
- Suppress dllexport warning on Windows
- Make torch-xpu-ops build depend on ATen XPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161152
Approved by: https://github.com/EikanWang
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
- Empty containers are Falsey
- Hoist cheap checks first
- Microbenchmarked single-element set access method
Benchmark code:
```
import timeit
to_test = [
('list(x)', 'x = set([3])'),
('x[0]', 'x = [3]'),
('list(x)[0]', 'x = set([3])'),
('next(iter(x))', 'x = set([3])'),
]
for (stmt, setup) in to_test:
res = timeit.timeit(stmt=stmt, setup=setup)
print(f"Time for `{stmt}`: {res}")
```
Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux)
```
Time for `list(x)`: 0.03418
Time for `x[0]`: 0.00852
Time for `list(x)[0]`: 0.03561
Time for `next(iter(x))`: 0.02278
```
FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161308
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #161301, #161292, #161304
This took me a bit to figure out and I'm pretty sure I've looked at
this code before. Pybind uses
`return_value_policy::reference_internal` for `def_property`, which
[causes the owning object to be kept alive for the lifespan of the
return
value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html),
allowing the getter to safely avoid copying the property
value. However, lambdas act like they return `auto`, not
`decltype(auto)`, so our lambdas themselves were forcing copies!
Testing: observed std::vector<Argument> copying disappear in Linux
perf profile of someOpInfo._schema.arguments/returns (in
_python_dispatch.correct_storage_aliasing).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161301
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab
If number of mxn blocks can not occupy all the threads, use smaller register block size will get better performance since the computing size per thread is smaller.
It may get ~20% performance improvement for the real case `m1_n512_k4096`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161144
Approved by: https://github.com/leslie-fang-intel
Prior to this PR, we have:
```
[Default Behavior] uses `tl.math.exp({x})`:
eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592
TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`:
eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367
Update inductor to use `tl.extra.libdevice.exp(tmp0)`:
eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593
```
Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now.
Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829
Approved by: https://github.com/jansel
Fixes#159247
Issue 1: Accuracy Problem with Non-Divisible KV Sequences
---------------------------------------------------------
### Background
Paged attention in flex decoding produced inaccurate results when KV sequence length is not divisible by block size. For example, when `KV_S = 64` and `block_size = 128`, the output didn't match standard attention accuracy.
### Root Cause
The current paged attention does not apply upper mask mod when converting from logical to physical mask mod. Instead, it uses a noop_mask by default which makes all the values unmasked, leading to an accuracy mismatch. Adding a upper mask mod according to the origin actual kv_len (64 in this test case) resolves the issue.
### Solution
* **Applied proper upper bound masking**: Updated all calls to `convert_logical_block_mask` to pass `kv_len` as a tensor with proper shape `[B, KV_S]` to provide information of actual batched KV sequence length. The function now correctly applies upper bound checks using the actual KV sequence lengths for each batch
### Files Modified
* `torch/nn/attention/experimental/_paged_attention.py`: Added `kv_len` parameter as a tensor to `get_mask_mod` and applied upper mask to the new mask mod.
* `test/inductor/test_flex_attention.py`: Fixed all related `kv_len` parameter call in the tests
* `test/inductor/test_flex_decoding.py`: Fixed all related `kv_len` parameter call in the tests
Issue 2: Invalid Memory Access (IMA) in Triton Kernels
------------------------------------------------------
### Background
The Triton kernel for flex attention was experiencing invalid memory access errors when running with compute sanitizers, particularly with short KV sequences and small batch sizes.
### Root Cause
* Kernel launches CTAs (Cooperative Thread Arrays) proportional to GPU's multi-processor count (108 via `SPLIT_KV`)
* With small workloads, many CTAs remain idle but still attempt to access `kv_indices` with invalid `indices_idx` values
* This caused out-of-bounds memory access violations
### Solution
Implemented boundary checks with early exit:
1. **Added `MAX_VALID_KV_IDX` parameter** in `torch/_inductor/kernel/flex/flex_decoding.py`
* Calculate maximum valid KV index based on actual `kv_indices` tensor size and pass it to Triton template
2. **Added early exit logic** in `torch/_inductor/kernel/flex/templates/flex_decode.py.jinja`
* Boundary checks before accessing `kv_indices` in both normal and full blocks
* Idle CTAs with invalid `indices_idx` skip computation entirely
This prevents invalid memory access while reducing wasted computation on idle thread blocks.
Testing & Validation
--------------------
### Accuracy Tests
* Added comprehensive test cases covering KV sequences not divisible by block sizes
* Verified output matches standard attention for various sequence length combinations
### Sanitizer Results
`========= COMPUTE-SANITIZER Starting standalone test_max_autotune... Running test_max_autotune on device: cuda max_autotune config: True test_max_autotune completed successfully! Test passed! ========= ERROR SUMMARY: 0 errors`
**Before**: More than 13720 invalid memory access errors with sanitizers
**After**: Clean execution with 0 errors
Both fixes work together to ensure paged attention produces accurate results while running safely without memory access violations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160861
Approved by: https://github.com/BoyuanFeng
`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.
This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group.
---
BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”
See:
c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342
Approved by: https://github.com/wconstab
Fixes https://github.com/pytorch/pytorch/issues/161510
Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M libkineto/third_party/dynolog
M libkineto/third_party/fmt
M libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M libkineto/third_party/dynolog
M libkineto/third_party/fmt
M libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: third_party/kineto (untracked content)
% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](https://github.com/pytorch/pytorch/issues/159769#issuecomment-3188568335).
Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed.
As criteria:
- there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input)
- we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing
- we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved).
Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161055
Approved by: https://github.com/jansel
Which heavily borrows implementation logic from `topk`
As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817
Approved by: https://github.com/dcci
We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core
symmetric memory APIs when Async TP ops fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820
Approved by: https://github.com/kwen2501
[#RFC153024](https://github.com/pytorch/pytorch/issues/153024)
**Motivation**
1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device.
2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256.
**What does this PR do?**
1. Enable the device type for Flexattention kernel and UTs to ensure all important UTs pass on XPU device.
2. For E2E model inference, ensure the functionality of LLM models inference with FlexAttention to be ready.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553
Approved by: https://github.com/EikanWang, https://github.com/drisspg
Co-authored-by: Mao Yunfei <yunfei.mao@intel.com>
Co-authored-by: Xingyuan Li <xingyuan.li@intel.com>
Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: Xiao, Wang <wang.xiao@intel.com>
# why
- untested so far
# what
- add an empty config heuristic for all devices for decompose k
- the cuda heuristic, because it is more specific, will still be picked
up
- add notes explaining how to enable on other devices
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k "decompose_k"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161795
Approved by: https://github.com/PaulZhang12
ghstack dependencies: #161767
# why
- not having a heuristic is an error but should not crash, just provide 0 configs
- some heuristics are cross device type
- cleaner to be explicit about being cross device type than having to
enumerate every possible device type
# what
- on registration, supply device_type=None (explicitly) to say this
heuristic is cross device
- test to guard the heuristics hierarchies
# testing
```
python3 -bb -m pytest test/inductor/test_template_heuristics_registry.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161767
Approved by: https://github.com/PaulZhang12
Fixes#161729
Written by codex
This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755
Approved by: https://github.com/malfet, https://github.com/albanD
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.
https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677
Approved by: https://github.com/kwen2501
ghstack dependencies: #161676
Fixes https://github.com/pytorch/pytorch/issues/161510
Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M libkineto/third_party/dynolog
M libkineto/third_party/fmt
M libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M libkineto/third_party/dynolog
M libkineto/third_party/fmt
M libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: third_party/kineto (untracked content)
% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
**Summary**
When `device_id` is not None, barrier() will choose the accelerator of the most
pripority, which means if the test specifies to use CPU for testing while CUDA is
available on the host, the barrier() will use CUDA. To avoid this and better respect
`self.device_type`, we add this branch to enforce barrier() to use CPU when
`self.device_type` is CPU and other accelerator is also available.
**Test**
`pytest test/distributed/tensor/test_dtensor_testbase.py`
**Debugging Output**
```
# from init_process_group()
init pg: backend=gloo, device_id = None
default_pg has backend: gloo, device_types: [device(type='cuda'), device(type='cpu')]
# from barrier()
barrier: device_ids = [10], devices = [], device = None, PG=[device(type='cuda'), device(type='cpu')]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161015
Approved by: https://github.com/tianyu-l
This PR refactors the XPU quantization ops to align their code structure with the CPU implementation for consistency. It also adds necessary header files to enable future integration with AOTI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157430
Approved by: https://github.com/angelayi
Summary:
Fixes#160749
For a model like
```
class M(torch.nn.Module):
def forward(self, x):
s = torch.sin(x)
z = 1j * s
return z
```
Its graph will be
```
graph():
%x : [num_users=1] = placeholder[target=x]
%sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%x,), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%sin, 1j), kwargs = {})
return (mul,)
```
`1j` will appear as a constant complex argument in the `aten.mul`
Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_complex_constant
Rollback Plan:
Differential Revision: D80672323
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161517
Approved by: https://github.com/angelayi
Fixes https://github.com/pytorch/pytorch/issues/161510
Test plan:
```
% cd third_party/kineto
% git checkout fe80f9319479265f7a208e615e16a363b993d50c; git submodule update --init --recursive
M libkineto/third_party/dynolog
M libkineto/third_party/fmt
M libkineto/third_party/googletest
Previous HEAD position was 5e75018 Fix Local Time on Windows Builds (#1104)
HEAD is now at fe80f93 Fix MSVC Error (#1134)
Submodule path 'libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159'
Submodule path 'libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21'
Submodule path 'libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723'
% git checkout 5e75018; git submodule update --init --recursive
M libkineto/third_party/dynolog
M libkineto/third_party/fmt
M libkineto/third_party/googletest
Previous HEAD position was fe80f93 Fix MSVC Error (#1134)
HEAD is now at 5e75018 Fix Local Time on Windows Builds (#1104)
warning: unable to rmdir 'third_party/prometheus-cpp': Directory not empty
Submodule path 'libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e'
Submodule path 'libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850'
Submodule path 'libkineto/third_party/fmt': checked out '0041a40c1350ba702d475b9c4ad62da77caea164'
Submodule path 'libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347'
% cd ../..
% git status
HEAD detached from 649e397c6de
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: third_party/kineto (untracked content)
% time git submodule foreach --recursive git clean -ffdx
...
git submodule foreach --recursive git clean -ffdx 0.47s user 0.96s system 88% cpu 1.625 total
% git status
HEAD detached from 649e397c6de
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161748
Approved by: https://github.com/atalman
Summary: this will only cause it in the event that we are serializing a triton hop. there are a few tests that do weird mocking stuff that this function doesn't like, so this will prevent it from being called there.
Test Plan:
att
Rollback Plan:
Differential Revision: D81261486
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161737
Approved by: https://github.com/angelayi
SAC interaction with triton kernel:
- In eager, triton ops are not dispatchable, and so it is always ignored by SAC, i.e., always recomputed.
- In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager.
- If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/xmfan
# why
- enable it to go through commont template heuristics point
- make easier to use in common extension point e.g. lookup table
# what
- break template heuristic into base + triton
- move k_split generation logic into a templateheuristic for decompose k
- register through normal mechanism
- to make testing work, add a context manager to temporarily set
template heuristics for a template/op to empty (effectively skipping
it). This is used for decompose k test to disable triton choices
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D80670918](https://our.internmc.facebook.com/intern/diff/D80670918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161098
Approved by: https://github.com/jansel
ghstack dependencies: #161026, #161097
# why
- make it easier to integrate into lookup table later
# what
- current version generates templates on the fly and uses them
to generate a single choice
- lookup table and performance model work best when there is a
stable set of templates (with predictable names) and those
are then parametrized
- this change makes it so that there is a single DecomposeK template
with a stable name, and the k split is the only parametrization we do
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_decompose_k_dynamic_False_bfloat16_sizes1 -v
```
Differential Revision: [D80670913](https://our.internmc.facebook.com/intern/diff/D80670913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161026
Approved by: https://github.com/PaulZhang12, https://github.com/jansel
Summary:
Add fastResizeToZero whenever we are reusing output tensors. Otherwise it keeps throwing warning
```
Warning: An output with one or more elements was resized since it had shape [10], which does not match the required output shape [181]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (function _resize_output_check)
```
Test Plan:
Run local replayer.
```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=786096203
SNAPSHOT_ID=11
HARDWARE_TYPE=1 ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 3443 2>&1 | tee ~/logs/${MODEL_TYPE}/predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}
sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 1000 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_ads_mtml_offsite_cvr_oba_optout_dedicated_model_100 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} false 3443 false 2>&1 | tee ~/logs/${MODEL_TYPE}/replayer_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}
```
Before: P1921177565
After: P1921178087
Rollback Plan:
Differential Revision: D81177596
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161679
Approved by: https://github.com/henryoier
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D79903317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
This PR fixes:
- Numpy >= 2.1 version detection (instead of python 3.13 version detection) to skip some tests (numpy 2.1 can be installed for older python versions)
```
test_quantization.py::TestDynamicQuantizedOps::test_qlinear
test_quantization.py::TestDynamicQuantizedOps::test_qlinear_legacy
test_quantization.py::TestQuantizedLinear::test_qlinear
test_quantization.py::TestQuantizedLinear::test_qlinear_leaky_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_relu
test_quantization.py::TestQuantizedLinear::test_qlinear_tanh
test_quantization.py::TestQuantizedLinear::test_qlinear_with_input_q_dq_qweight_dq_output_fp32
```
- A couple of SDPA tests on MI355 by adjusting fudge_factors:
```
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32
test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161429
Approved by: https://github.com/jeffdaily
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).
Changes Included
- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.
Fixes#147282
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos, https://github.com/atalman
And actually use the right function, as [`torch.round`](https://docs.pytorch.org/docs/stable/generated/torch.round.html) doesn't use `std::round`, but rather `std::rint`, which can be easily seen by running something like
```python
import torch
print(torch.arange(-3., 3., step=.5, device='mps').round())
print(torch.arange(-3., 3., step=.5, device='mps').cpu().round())
```
Before this change it printed
```
tensor([-3., -3., -2., -2., -1., -1., 0., 1., 1., 2., 2., 3.], device='mps:0')
tensor([-3., -2., -2., -2., -1., -0., 0., 0., 1., 2., 2., 2.])
```
But after this change results match
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161712
Approved by: https://github.com/dcci
Mainly, this helps tell the user more info about the operator that
failed to run if it fails during sharding propagation.
Previously, only this exception would be raised:
```
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')
```
Now you get both the above exception as well as
```
The above exception was the direct cause of the following exception:
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```
<stacktrace omitted>
<details><summary>detailed error</summary>
```
======================================================================
ERROR: test_linear (__main__.TestDTensor)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 668, in wrapper
self._join_processes(fn)
File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 932, in _join_processes
self._check_return_codes(fn, elapsed_time)
File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 972, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 4 exited with error code 10 and exception:
Traceback (most recent call last):
File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 150, in dispatch
self.sharding_propagator.propagate(op_info)
File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 309, in propagate
OutputSharding, self.propagate_op_sharding(op_info.schema)
File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 45, in __call__
return self.cache(*args, **kwargs)
File "/data/users/whc/pytorch/torch/distributed/tensor/_sharding_prop.py", line 329, in propagate_op_sharding_non_cached
op_strategy = self.op_strategy_funcs[op_schema.op](strategy_schema)
File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 673, in reshape_strategy
input_tgt_placements, output_placements = propagate_shape_and_sharding(
File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 601, in propagate_shape_and_sharding
in_dim = get_in_dim_to_shard(cmd)
File "/data/users/whc/pytorch/torch/distributed/tensor/_ops/_view_ops.py", line 537, in get_in_dim_to_shard
raise RuntimeError(
RuntimeError: ('Attempted to flatten sharded dimension 1, ', 'but only the leftmost dim of a Flatten can be sharded.')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 816, in run_test
getattr(self, test_name)()
File "/data/users/whc/pytorch/torch/testing/_internal/common_distributed.py", line 670, in wrapper
fn()
File "/data/users/whc/pytorch/torch/testing/_internal/common_utils.py", line 3224, in wrapper
method(*args, **kwargs)
File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 490, in wrapper
raise e
File "/data/users/whc/pytorch/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 487, in wrapper
func(self, *args, **kwargs) # type: ignore[misc]
File "/data/users/whc/pytorch/test.py", line 60, in test_linear
print("results: ", distributed_linear(distributed_input))
File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/users/whc/pytorch/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/whc/pytorch/torch/nn/modules/linear.py", line 134, in forward
return F.linear(input, self.weight, self.bias)
File "/data/users/whc/pytorch/torch/_compile.py", line 53, in inner
return disable_fn(*args, **kwargs)
File "/data/users/whc/pytorch/torch/_dynamo/eval_frame.py", line 1005, in _fn
return fn(*args, **kwargs)
File "/data/users/whc/pytorch/torch/distributed/tensor/_api.py", line 358, in __torch_dispatch__
return DTensor._op_dispatcher.dispatch(
File "/data/users/whc/pytorch/torch/distributed/tensor/_dispatch.py", line 163, in dispatch
raise RuntimeError(
RuntimeError: Sharding propagation failed for Op(op=aten.view.default, args_schema=Spec((Replicate(), Shard(dim=0), Shard(dim=1), Shard(dim=2)) on (8, 8, 4)), [64, 4] @ mesh: (1, 2, 2, 2))
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161574
Approved by: https://github.com/zpcore, https://github.com/XilunWu
Reland of https://github.com/pytorch/pytorch/pull/159923
Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability.
2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict
3. We modify yolov3 to fix the previous silent incorrect behaviour
4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking
When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.
Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589
Approved by: https://github.com/avikchaudhuri
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D79903317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
`get_remote_tensor `: return a symmetric tensor given a peer rank.
The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.
As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
(Porting most of #161008)
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
`test_symmetric_memory.py` hangs like this:
```
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...
```
This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`.
However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function.
Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time.
Hang is gone now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668
Approved by: https://github.com/fegin
SAC interaction with triton kernel:
- In eager, triton ops are not dispatchable, and so it is always ignored by SAC, i.e., always recomputed.
- In compile, although we wrap triton kernels into HOPs, allowing us to intercept them, we still recompute by default rather than save by default, so that compile maintains the invariant of using less memory than eager.
- If you want to do something else (e.g. save the output of your triton kernel) you should wrap it in a custom op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161541
Approved by: https://github.com/drisspg, https://github.com/zou3519
ghstack dependencies: #160781
The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet.
To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477
Approved by: https://github.com/jansel
If TorchDispatchMode.ignore_compile_internals() is True, then we turn
off the TorchDispatchMode during the compilation process, instead
turning it back on during runtime of the compiled artifact.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161648
Approved by: https://github.com/bdhirsh
Adds the pre-dispatch handling for the AC hop. This lets the HOP pre-dispatch export without actually pre-dispatch tracing into it,. However, this is not sufficient to support AC in export:
- because the HOP body will still be in torch IR, so it will fail export verifiers
- the exported module also can't be ran in eager because the AC HOP relies on partitioner to embed RNG state saving/restoring
So it must be lowered by AOT Autograd into post-dispatch first before being executed, It suffices for my purposes though.
If users had checkpoint API use in their exported model, the behavior goes from silently incorrect to now be validation error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161479
Approved by: https://github.com/ydwu4
ghstack dependencies: #161353
# Context
In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.
However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).
The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:
> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.
But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.
I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)
The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.
# This PR
Remove restrictions against parallel start for NUMA binding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576
Approved by: https://github.com/d4l3k
Prints ranges of ranks succinctly.
e.g.
For a strided list of ranks, summarizes down to start:stop:step
```
0:4096:512
```
Omits step if it's 1
```
0:8
```
Note: endpoints are exclusive. This may not be intuitive to everyone,
but in the first above the last rank is 3584, and in the second it is
7.
Currently, does not support combinations of striding _and_ range. (e.g.
can not generate a representation like "0:2, 4:6, ..., 12:14". Is this
needed / useful? If so it could be added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284
Approved by: https://github.com/XilunWu
ProfilingGraphExecutor works like this:
1. do some unrelated JIT optimizations
2. Add profiling nodes to collect JIT information like tensor dtypes and shapes
3. Do some more unrelated JIT optimizations
4. Remove the profiling nodes and extract the tensor info, and then use the JIT tensor info to do optimizations.
This PR is intended to fix a bug in Step 4, where the profiling nodes were removed. It was previously assumed that all the things that were profiled were either Tensors or Optional[Tensor]s - otherwise, step 2 would not have introduced a profiling node.
However, we saw a case where step 3 would remove replace Optional[Tensor] inputs with `None` inputs (e.g. if a conditional that returned a Tensor or a None could be statically known to only follow the `None` branch).
To fix this, we essentially just modify the RemoveProfileNodesAndSpecializeTypes assert so that it accepts Tensors, Optional[Tensor]s, or None (the new part).
Note that this issue is probably somewhat uncommon (maybe why we didn't see it for the first 4 years that this code existed). I expect that, typically, any time that step 3 would convert `Optional[Tensor] -> None`, step 1 would have already done that. So it's difficult to reproduce in an end-to-end TorchScript workload.
Differential Revision: [D81068172](https://our.internmc.facebook.com/intern/diff/D81068172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161538
Approved by: https://github.com/nmacchioni
Summary:
In preparation for checking shape guards in export, this PR effectively switches `prefer_deferred_runtime_asserts_over_guards` to `False`, matching Dynamo.
Actually that's a lie: we switch it to `allow_complex_guards_as_runtime_asserts`, which is `False` by default but can be controlled via an internally API to be `True`. This makes the two flags synchronized, so we should be able to kill `allow_complex_guards_as_runtime_asserts` at this point.
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D79734206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160111
Approved by: https://github.com/tugsbayasgalan
Fixes#161640
Check if tensors are contiguous before using the no-graph implementation. Using the script in the issue above with this change I get expected results.
```
MPS contiguous result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061], device='mps:0')
MPS non-contig result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061], device='mps:0')
CPU non-contig result sample: tensor([ 1.3600, -2.9516, 1.3207, -3.5132, 1.7061])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161641
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Summary: When performing constant folding, we must skip over operators that have symbolic `fill_value`.
Test Plan:
CI
Rollback Plan:
Reviewed By: kalpit-meta-1
Differential Revision: D80965936
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161437
Approved by: https://github.com/StellarrZ
Nested continuation function code objects are now unique w.r.t. stack trace below (and including) the current code object.
Without this change, e.g. in the added test, `f3` would be recompiled on the second graph break.
Followup: we can skip guards on continuation functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159786
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817, #160138
Fix comments to reflect that we no longer codegen cells to be sent to resume function as inputs - they are instead codegen'd after the unsupported instruction in order to build resume functions that are closures.
Also simplify some codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160138
Approved by: https://github.com/anijain2305
ghstack dependencies: #159329, #159678, #159817
This method never triggered. It's used in 2 tests and they pass, so no serious
concern.
Note that I did introduce and fix a latent bug, which is if we called
shutdown_compile_workers, jobs would crash with this change due to ready_future
being finished if we called wait.
However we only call wait in tests so that bug is fine.
The other behaviour, is that if you called shutdown, I believe we may
potentially block on your first triton compile after that, until the pool was
ready. This should correctly switch to direct mode, until the pool is ready on
later warmups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161561
Approved by: https://github.com/masnesral
ghstack dependencies: #161452
Added a optional name argument to SubprocPool.submit.
We record this in a dictionary, and when raising exceptions, add the name.
We manage the lifecycle the same as the pending futures.
Added a specific testcase to make sure this logs correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161452
Approved by: https://github.com/masnesral
# Feature
2d launch grids with dynamic shapes can contain sympy expressions like `floor(x / 128 + y / 128)`. This breaks the dynamic shapes tracer which only supports `FloorDiv`, and not `floor`. To handle this case, call `sympy.together` prior to pattern matching to convert this to `floor((x + y) / 128)`. Then, we can recognize the pattern and map it to `FloorDiv(x + y, 128)`.
# Test plan
Added a custom Triton test exposing this. The test calls a 2d autotuned kernel with dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161582
Approved by: https://github.com/nandesuka
Adding a new feature to torch.compile(fullgraph=True) which "aot_compile" a function with given example inputs.
On user side it should look like:
```
def foo(x, y):
return x + y
compiled_fn = torch.compile(fullgraph=True).aot_compile(((torch.randn(3, 4), torch.randn(3, 4)), {}))
```
This is different from the traditional `torch.compile` workflow where compiled object will be a drop-in replacement for the original eager model:
```
tensor input -> torch.compile() -> tensor output (and populates the cache entry)
```
`aot_compile` will instead return a compiled function as result, and it's purely functional and doesn't populate the compile cache entry in dynamo:
```
tensor input -> aot_compile() -> compiled function
```
The aot compiled function will be savable and loadable on disk as well:
```
torch.compile(fullgraph=True).aot_compile(...).save_compiled_function('my/path')
compiled_fn = torch.compiler.load_compiled_function("my/path")
```
Right now we treat compiler backend as a blackbox and it needs to implement the following interface to make compile artifacts serialzable:
```
class SerializableCallable:
def save_compile_artifacts(): ....
def load_compile_artifacts(): ....
```
We haven't implemented this for inductor yet, but this shouldn't be an issue since we gate this feature through `torch._dynamo.config.aot_compile` (which defaults to False), and this will be left as follow up PR to the current PR.
Differential Revision: [D80914270](https://our.internmc.facebook.com/intern/diff/D80914270/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161383
Approved by: https://github.com/tugsbayasgalan
Including below changes,
- Add XPU support package 2025.2 build and test in CI for both Linux and Windows
- Keep XPU support package 2025.1 build in CI to ensure no break issue until PyTorch 2.9 release
- Upgrade XPU support package from 2025.1 to 2025.2 in CD for both Linux and Windows
- Rename Linux CI job name & image name to n & n-1
- Update XPU runtime pypi packages dependencies of CD wheels
- Remove deprecated support package version docker image build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158733
Approved by: https://github.com/EikanWang, https://github.com/atalman
Summary:
It's hard to understand how it's working in most of our models, but in general it looks like `aten::copy_` is replaced incorrectly.
There are two schemas for `aten::copy_`:
1. `aten::copy_.Tensor(Tensor(a!) self, Tensor other) -> Tensor(a!)`
2. `aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)`
According to the logic in the comments we don't need one of the parameters for `aten::index_put_`.
It seems logic has been inferred from ordinary `aten::copy` where there could be a third parameter which is `non_blocking` flag.
Depending on the execution environment the sliced copying can be replaced either by first schema or by second schema with explicitly setting default parameter to `False`.
If first schema is selected it will lead to the crash (which is easily to catch in our prod env). In case of the second schema selection, there is no crash, but the third parameter is treated as `accumulate` parameter of the `index_put_` function which doesn't make sense.
So, in any case usage of the third parameter must be removed from the `aten::copy_` replacement.
For more details and check this post:
https://fb.workplace.com/groups/1405155842844877/permalink/25337687649165028/
Test Plan:
The test fails in production envirounment only.
In the test env `non_blocking` flag is mapped as `False` to the `acumulate` flag, which doesn't cause test to fail, but has no sense in terms of flags mapping.
The export works without errors, before the fix it was failing with accessing by index out of bounds vector, like this:
```
1095 _C._jit_onnx_log("Torch IR graph at exception: ", graph)
File ~/.bento/kernels/bento_kernel_gaia_ml/1578/bento_kernel_gaia_ml_binary-inplace#link-tree/torch/onnx/utils.py:636, in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module)
629 _C._jit_pass_lower_all_tuples(graph)
630 # in _jit_pass_onnx, symbolic functions are called for each node for conversion.
631 # However, there are nodes that cannot be converted without additional context.
632 # For example, the number of outputs from split (and whether it is static or dynamic) is unknown
633 # until the point where it is unpacked by listUnpack node.
634 # This pass does a preprocess, and prepares the nodes such that enough context can be received
635 # by the symbolic function.
--> 636 _C._jit_pass_onnx_remove_inplace_ops_for_onnx(graph, module)
637 _C._jit_pass_onnx_preprocess(graph)
639 # onnx does not support tuples, so try to remove them
RuntimeError: vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)
```
The test script:
```
import torch as th
import tempfile
class CopyTest(th.nn.Module):
def forward(
self,
input_th: th.Tensor
):
to_fill = th.ones((3, 3))
to_fill[:, 0] = input_th[:, 0]
return to_fill
m = CopyTest()
test_tensor = th.zeros((3, 3))
with tempfile.NamedTemporaryFile() as f:
th.onnx.export(
m,
(test_tensor,),
f,
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=["input"],
output_names=["features"],
dynamo=False,
)
```
The exported model test:
```
import torch
import onnx
import onnxruntime
model_name = '/home/ironsided/test_model.onnx'
onnx_model = onnx.load(model_name)
onnx.checker.check_model(onnx_model)
example_inputs = (torch.zeros(3, 3),)
onnx_inputs = [tensor.numpy(force=True) for tensor in example_inputs]
print(f"Input length: {len(onnx_inputs)}")
print(f"Sample input: {onnx_inputs}")
ort_session = onnxruntime.InferenceSession(
model_name, providers=["CPUExecutionProvider"]
)
onnxruntime_input = {input_arg.name: input_value for input_arg, input_value in zip(ort_session.get_inputs(), onnx_inputs)}
# ONNX Runtime returns a list of outputs
onnxruntime_outputs = ort_session.run(None, onnxruntime_input)[0]
print(onnxruntime_outputs)
```
The produced result is correct:
```
Input length: 1
Sample input: [array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], dtype=float32)]
[[0. 1. 1.]
[0. 1. 1.]
[0. 1. 1.]]
```
Rollback Plan:
Differential Revision: D80797028
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161263
Approved by: https://github.com/justinchuby, https://github.com/jermenkoo
Summary:
- Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse.
- Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}.
Testing:
- Add inline test to verify structure and output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448
Approved by: https://github.com/xmfan
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).
Changes Included
- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.
Fixes#147282
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators.
This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644
Approved by: https://github.com/albanD
Summary:
Use debug handle on kernel names to distinguish different calls to the same kernel.
Previous kernel name: kernel_name
New kernel name: kernel_name:debug_handle
We add the debug handle to the tlparse artifacts: `inductor_provenance_tracking_node_mappings` and `inductor_provenance_tracking_kernel_stack_traces`.
We also add debug handles in the comments of the generated code so we can map to them in the provenance tracking highlighter tool: https://github.com/pytorch/tlparse/pull/134
Example output code is below. If a kernel doesn't have a debug handle, the `[Provenance debug handles]` comment line will not be written.
```
# Topologically Sorted Source Nodes: [y, z], Original ATen: [aten.addmm, aten.gelu]
# [Provenance debug handles] triton_poi_fused_addmm_gelu_2:3
stream0 = get_raw_stream(0)
triton_poi_fused_addmm_gelu_2.run(buf4, primals_5, 300, stream=stream0)
```
The debug handles will also be used by downstream profilers such as zoomer.
Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing
```
Rollback Plan:
Differential Revision: D78994959
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161110
Approved by: https://github.com/angelayi
(Porting most of #161008)
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
For the kernels that need peer pointers directly, the rendezvous handle should allow user to get the offset of tensor wrt to base allocation address. Thus the need to add an `offset` field to SymmMem handle.
But we don't want to cache all the handles just bc they have different offsets, hence the search and cache logic below:
(i) At rendezvous, the search key is still `x.storage().data_ptr()`, like now, but it should do search in 2 parts - one is just dictionary lookup, like today, if that failed, it needs to search `allocations_` to see if the storage ptr falls in one of the segments. This is possible as we have all segments recorded during alloc.
(ii) If this segment hasn't been rendezvoused, we rendezvous it, cache it in the `symm_mem_` map with its base address as key.
(iii) We still need to return a handle for the current tensor, with a corresponding offset. This handle will be a shallow copy of the base handle, with the offset adjusted.
Some impl details:
(i.1) If we find a matching allocation, we can immediately use the allocation base address to do a re-search in `symm_mem_`.
(iii.1) To make the handle copy shallow, we move the common information -- base ptrs, base signal pad, etc -- to a structure referenced by both handles. The structure is called `NVSHMEMPeerAllocInfo`. A copy of handle just adds one more `intrusive_ptr` to it. The handle copy constructor accepts an `offset` argument.
Test:
Existing tests should not fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161470
Approved by: https://github.com/ngimel
# Feature
Add support for custom Triton kernels to the FX backend. This turned out not to require any new features, except for a minor change to handle `tl.constexpr` arguments which are not part of the autotuning config.
# Caveat
This may not cover every possible case. For example, we might need more features for autotuning custom Triton code. This PR entirely skips the [custom codegen ](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py#L1034-L1039) for user-defined grid functions, but there may be edge cases requiring this logic. However, this PR seems to do a reasonable job as many of the grids end up being written into Inductor/Triton metadata and don't require special codegen.
As a follow up, I'm planning to test this against all of AOTI's custom Triton kernel tests.
# Test plan
Added a CI test using a custom Triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161474
Approved by: https://github.com/angelayi
This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084).
Changes Included
- Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination.
- Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor.
- Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler.
- Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code.
- Added test cases to verify both "should throw" and "should not throw" scenarios.
Fixes#147282
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677
Approved by: https://github.com/mlazos
We already have a context manager "set_checkpoint_early_stop". This PR adds a kwarg that toggles the same setting.
It is also useful to have a kwarg version of the setting in addition to the context manager because is annoying to apply a context manager when the AC is being applied via CheckpointWrapper.
Similar to the "debug" kwarg and the corresponding "set_checkpoint_debug_enabled" context manager, the context manager defaults to None and overrides the local setting when non-None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160781
Approved by: https://github.com/tianyu-l
If onnx exporter fallbacks to draft_export with big models, this is taking forever for users, and possibly spam the printout, which keeps users from their stack trace with strict=False.
We could consider make another API for draft_export as debugging tool, or combine it with report=True when "model is small"?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161454
Approved by: https://github.com/justinchuby
This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161180
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary:
convert_frame.compile_frame used to take a callback transform function which will capture the frame object it has, but the frame information is not passed directly into compile_frame function.
This PR changes the signature of compile_frame so that frame information is directly passed in the function without taking a callback. This makes it easier to build fullgraph capture API on top of compile_frame.
Test Plan:
CI
Rollback Plan:
Differential Revision: D81041296
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161514
Approved by: https://github.com/tugsbayasgalan
```python
import torch
torch._dynamo.config.capture_scalar_outputs = True
class M(torch.nn.Module):
def forward(self, idx, x):
u0 = idx.item()
x0 = x.select(0, u0)
def fn():
return x0.sin()
return torch.cond(x0.sum() > 0, fn, fn)
m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64, device="cuda"), torch.randn(3, 3, device="cuda"))
print(out)
```
Before the PR, we didn't track the storage_offset symbol of a tensor. After https://github.com/pytorch/pytorch/pull/157605, we create an unbacked_symint for stroage_offset for the result of select. So when we try to lift the free basic symbols of x0 during speculating fn, we found a free symbol that's not bound to a proxy.
This PR tracks the symbols of storage_offset and associated it with a proxy using torch.ops.aten.storage_offest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161199
Approved by: https://github.com/zou3519
ghstack dependencies: #161198
Before the change in this PR, we have an error for the following code
```python
import torch
torch._dynamo.config.capture_scalar_outputs = True
class M(torch.nn.Module):
def forward(self, idx, x):
u0 = idx.item()
x0 = x.select(0, u0)
def fn():
return x0.sin()
return torch.cond(x0.sum() > 0, fn, fn)
m = M()
out = torch.compile(m, fullgraph=True)(torch.tensor(0, dtype=torch.int64), torch.randn(3, 3))
```
The error is caused when speculate fn, and tries to lift symbol of x0.storage_offset() but found the symbols doesn't have a source associated with it.
What really happens is that, when input tensor is a scalar tensor of int type and resides on CPU, we have a short cut that creates a norm symint when .item() is called see https://github.com/pytorch/pytorch/pull/126245.
However, previously, we only track the unbacked symint output of an operation because we believe all the backed symint must have a source associated with it and has already bee lifted as input at the top-level. Now this invariant no longer holds, so we end up an error saying the symbol doesn't have source (because only input and symbols derided from inputs have source and result of .item() doesn't have a source).
In this PR, we start to also track the normal symint with the proxy that created it (i.e. in this case the proxy .item()).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161198
Approved by: https://github.com/zou3519
A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op.
**Benchmark**
In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a **1.62×** speedup.
matmul
<img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" />
addmm
<img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" />
**Testing**
End-to-end training:
We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm.
<img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" />
Unit test:
```python
# dummy model and data
model0 = Linear(10, 10, bias=False)
model1 = copy.deepcopy(model0)
inputs = torch.randn(8, 10)
targets = torch.randn(8, 10)
loss = MSELoss()
lr = 1e-3
wd = 0.1
momentum = 0.95
opt_ref_muon = Muon(
params=model0.parameters(),
lr=lr,
weight_decay=wd,
momentum=momentum,
nesterov=nesterov,
adjust_lr_fn="original",
)
opt_exp_muon = Muon(
params=model1.parameters(),
lr=lr,
weight_decay=wd,
momentum=momentum,
nesterov=nesterov,
adjust_lr_fn="original",
use_addmm=True,
)
out_ref = model0(inputs)
loss_ref = loss(out_ref, targets)
opt_ref_muon.zero_grad()
loss_ref.backward()
opt_ref_muon.step()
out_exp = model1(inputs)
loss_exp = loss(out_exp, targets)
opt_exp_muon.zero_grad()
loss_exp.backward()
opt_exp_muon.step()
for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
torch.testing.assert_close(p_ref, p_exp)
```
shows numeric difference, but this is expected on bf16 precision:
```
Mismatched elements: 96 / 100 (96.0%)
Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed)
Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed)
```
~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~
Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161379
Approved by: https://github.com/janeyx99
Summary: Since Inductor skips JIT compilation for Triton kernels, we need to manually invoke `knobs.runtime.jit_post_compile_hook` if one exists. Here, we do this to enable Tritonparse to extract launch metadata from Inductor launched kernels. We can control whether or not Inductor will run the hook with a new `TORCHINDUCTOR_RUN_JIT_POST_COMPILE_HOOK=1 ` config variable.
Reviewed By: davidberard98
Differential Revision: D80624932
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161443
Approved by: https://github.com/FindHao
Summary:
We split the refactoring in two parts for forward compatibility concerns
First, we land the deserialization (loading part)
Then, we land the serialization (saving part)
Save weights and constants as individual files in PT2 archive. Each weight/constant will be saved as raw bytes, unless it is a custom object (TorchBind object) or a non-fake tensor subclass, for these two special cases we still save them using pickle.
The metadata of saved tensors along with the file name will be saved as `PayloadMeta`.
The mapping from FQN to `PayloadMeta` will be saved as `PayloadConfig` under `WEIGHTS_CONFIG_FORMAT` and `CONTANTS_CONFIG_FORMAT`
This changes the serialization in python side when calling `torch.export.save()`.
For deserialization in python `torch.export.load()`, we make it BC-safe by allowing loading legacy format weights/constants.
For deserialization in C++ `torch/nativert/ModelRunner.cpp`, we make this a BC breaking change as currently the OSS ModelRunner API is not being used.
The file structure
```
├── archive_format
├── archive_version
├── byteorder
├── .data
│ ├── serialization_id
│ └── version
├── data
│ ├── sample_inputs
│ │ └── model.pt
│ ├── constants
│ │ ├── tensor_0
│ │ ├── tensor_1
│ │ └── model_constants_config.json
│ └── weights
│ ├── weight_0
│ ├── weight_1
│ ├── weight_2
│ ├── weight_3
│ └── model_weights_config.json
└── models
└── model.json
```
Test Plan:
CI
Rollback Plan:
Differential Revision: D80035490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160394
Approved by: https://github.com/SherlockNoMad
Summary: In `tuned_scaled_mm()`, we unsqeeuze any scalar scale from [] -> [1, 1]. Later, when we are determining how to set the `SCALING_ROWWISE` kernel attribute, we check whether the scale has 2 dimensions. However, since we previously unsqueezed any scalar scales, this will always evaluate to True.
Test Plan:
Run the following tests in test/inductor/test_fp8.py:
test_tensorwise_scaling_tma_template
test_rowwise_scaling_tma_template
Rollback Plan:
Differential Revision: D80108117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160450
Approved by: https://github.com/eellison
Summary:
This diff removes configs that require more shared memory than the hardware limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```
Test Plan:
```
buck2 test mode/dev-nosan fbcode//caffe2/test/inductor:max_autotune -- test_max_autotune_prune_choices -v 1,stderr
```
Rollback Plan:
Differential Revision: D80594562
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161040
Approved by: https://github.com/eellison
This is far simpler than #155164 since we never destroy the cudaGraphExec_t.
The request comes from TRT-LLM specifically. The motivation is that some power users would like to mutate specific kernel parameters via APIs like `cudaGraphExec*SetParams` after a cuda graph has been instantiated. For example, a common request has been to be able to change the sequence length of attention kernels, after having captured a graph for the largest possible sequence length. It turns out that the host overhead you eliminate via cuda graphs in LLM inference ends up causing an increase in computation time when you size your kernels to the maximum possible sequence length (which I believe is done in both TRT-LLM and vLLM). Attention is the most problematic kernel because its computation time is quadratic in the sequence length, rather than linear.
This can work if your attention kernel can work for arbitrary shapes (this is not the case for all attention implementations! Many of them specialize with templates), and you have a persistent kernel that allocates only as many blocks as you have SM's (so you don't have to figure out how many blocks to allocate for a specific sequence length). Using a conditional SWITCH node is a better generic approach to this problem, but that requires more infrastructure work.
Note that this requires knowledge of the exact location of the value in your kernel's parameter buffer to mutate. It won't work with arbitrary stream capture code whose kernels you don't know before hand. So I expect this code path to be rarely used.
Testing:
```
pytest -s -k raw_graph_exec test/test_cuda.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161294
Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/eqy
Summary:
- Emit a structured trace per compiled graph execution to reconstruct execution order in TLParse.
- Adds debug.log_graph_execution(name) called from `CompiledFxGraph.__call__`, producing an artifact named inductor_graph_execution with payload {"graph": "graph_<id>"}.
Testing:
- Add inline test to verify structure and output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160448
Approved by: https://github.com/xmfan
## Problem
Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below):
When there are two attribute mapping to the same tensor, the strict mode will
1. Have a standard param buffer table to standardize the name (bug happens [here](f861dc1826/torch/export/_trace.py (L356))! when 2 parameter have same id(param), the latter name will overwrite the previous name)
2. [Update](f861dc1826/torch/export/_trace.py (L1481)) exported signature with updated standard FQN (problematic)
3. When getting exported_program.module(), it will call [_unlift_exported_program_lifted_states](f861dc1826/torch/export/exported_program.py (L1297)) to recover attribute from exported signature where the parameter name is defined and standardized
Then the named_parameter of this module will have overwritten name instead of original name
## How to reproduce the issue?
reproduce issue shared by @taotaohuang001
torch version: 2.8.0
```python
import torch
from torch import nn
# ---- Toy model with embedding weight sharing (aliasing) ----
class Toy(nn.Module):
def __init__(self):
super().__init__()
self.embedding_layers = nn.ModuleDict()
tbl = nn.Embedding(100, 8)
self.embedding_layers["ActorId"] = tbl
# Alias: reuse the SAME module instance for another feature
self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"]
self.proj = nn.Linear(16, 1)
def forward(self, feats: dict[str, torch.Tensor]):
e1 = self.embedding_layers["ActorId"](feats["ActorId"])
e2 = self.embedding_layers["RootActorId"](feats["RootActorId"])
return self.proj(torch.cat([e1, e2], dim=-1))
torch.manual_seed(0)
m = Toy().eval()
# Show pre-export parameter names (canonicalized; shared weight appears once)
print("PRE-EXPORT named_parameters:")
print([name for name, _ in m.named_parameters()])
# Sanity: the two feature names point to the same weight object
w1 = m.embedding_layers["ActorId"].weight
w2 = m.embedding_layers["RootActorId"].weight
print("PRE-EXPORT alias -> same object:", w1 is w2, "| same storage:", w1.data_ptr() == w2.data_ptr())
# Example inputs (dict structure will be captured by export)
ex_in = {
"ActorId": torch.randint(0, 100, (4,)),
"RootActorId": torch.randint(0, 100, (4,)),
}
# ---- Export (in memory) and materialize the runnable module ----
ep = torch.export.export(m, (ex_in,), strict=True)
gm = ep.module() # GraphModule with new (canonical) parameter names
print("\nPOST-EXPORT named_parameters (GraphModule):")
post_names = [name for name, _ in gm.named_parameters()]
print(post_names)
# Prove alias persists after export: run fwd/bwd and check a single grad tensor exists
out = gm(ex_in).sum()
out.backward()
# Find the embedding weight in the exported module by shape (100, 8)
emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])]
print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>")
# Show that only one grad exists for the shared table
for name, p in gm.named_parameters():
if p.grad is not None and p.shape == torch.Size([100, 8]):
print("Grad present on shared embedding weight:", name, "| grad shape:", tuple(p.grad.shape))
break
```
And you will see parameters are different before and after export
```
PRE-EXPORT named_parameters:
['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias']
PRE-EXPORT alias -> same object: True | same storage: True
POST-EXPORT named_parameters (GraphModule):
['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias']
Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight
Grad present on shared embedding weight: embedding_layers.RootActorId.weight | grad shape: (100, 8)
```
## Solution
Fixing this issue by making sure latter named parameter will not overwrite the `param_buffer_table` when original model's named parameter already maps to certain parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160600
Approved by: https://github.com/angelayi
The performance cost of `dict` lookups keyed by `OpSchema` is a
significant minority of DTensor overhead. With this change we shave a
net ~1% off the total running time of the benchmark from #160580, as
measured by using cProfile and comparing cumulative time spent in
propagate + OpSchema's `__post_init__`. (`__post_init__` grew from
2.5% to 6.4% (+3.9%) and propagate shrank from 12.5% to 7.8% (-4.7%)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161234
Approved by: https://github.com/wconstab
ghstack dependencies: #161231
Updates the inductor-wrapper-fxir code to use the kernel.op_overload when generating extern kernel calls. This way we can keep the IR consistent with using ATen ops.
TODO: we're also inserting torch.empty_strided calls -- need to turn this into aten too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161195
Approved by: https://github.com/blaine-rister
Summary: This change updates `getattr_recursive` to handle qualnames with ModuleList that contain digit indices, for example, `op_instances.1.value_model.feature_weights`
Test Plan:
TBA
Rollback Plan:
Reviewed By: jiayisuse
Differential Revision: D80503985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161204
Approved by: https://github.com/jiayisuse
In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:
1. instantiate_device_type_tests()
2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
4. use "get_default_backend_for_device()" to get backend
5. enabled XPU for some test path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033
Approved by: https://github.com/guangyey, https://github.com/kwen2501
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.
After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).
The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator. The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.
Fixes#159991
confirmed docs rendered OK
<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy.
This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory.
**Usage**
```python
model = MyModelForCausalLM
# filter out your params manually
muon_params = [...]
adamw_params = [...]
muon = Muon(
params = muon_params
lr=lr,
wd=wd,
)
adamw = AdamW(
params = adamw_params
lr=lr,
wd=wd,
)
# in training loop
loss = model(input)
loss.backward()
muon.step()
adamw.step()
muon.zero_grad()
adamw.zero_grad()
```
~~**Additional usage**~~
~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~
```python
~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~
~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~
```
As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs.
By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW.
**Testing**
~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~
As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test.
2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.
<img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" />
**Numerics**
We evaluate our implementation with existing implementation to confirm numerical consistency.
As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap.
As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below.
```python
# dummy model and data
model0 = Linear(10, 10, bias=False)
model1 = copy.deepcopy(model0)
inputs = torch.randn(8, 10)
targets = torch.randn(8, 10)
loss = MSELoss()
lr = 1e-3
wd = 0.1
momentum = 0.95
opt_ref_muon = KellySingleDeviceMuon(
params=model0.parameters(),
lr=lr,
weight_decay=wd,
momentum=momentum,
)
opt_exp_muon = Muon(
params=model1.parameters(),
lr=lr,
weight_decay=wd,
momentum=momentum,
)
out_ref = model0(inputs)
loss_ref = loss(out_ref, targets)
opt_ref_muon.zero_grad()
loss_ref.backward()
opt_ref_muon.step()
out_exp = model1(inputs)
loss_exp = loss(out_exp, targets)
opt_exp_muon.zero_grad()
loss_exp.backward()
opt_exp_muon.step()
for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
torch.testing.assert_close(p_ref, p_exp)
```
As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without.
<img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" />
**Performance**
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:
- adamw_ddp finishes in 13.12 min
- pytorch_muon_ddp finishes in 13.45 min
Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is *2.5%* slower than AdamW.
AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms
<img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" />
Muon: Optimizer.step() takes ~54 ms, step time ~960 ms
<img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" />
**Note**
We restrict the implementation to accept only 2D parameters.
An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful.
Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound.
**Next Steps**
1. Add `MuP`
2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
3. Open-source unsharded Muon co-designed with FSDP2.
****
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213
Approved by: https://github.com/janeyx99
This pull request adds the following ops for sparse matrices using Eigen library:
```python
add(a_csr, b_csr)
add(a_csc, b_csc)
addmm(c_csr, a_csr, b_csr)
addmm(c_csr, a_csr, b_csc)
addmm(c_csr, a_csc, b_csc)
addmm(c_csr, a_csc, b_csr)
addmm(c_csc, a_csr, b_csr)
addmm(c_csc, a_csr, b_csc)
addmm(c_csc, a_csc, b_csc)
addmm(c_csc, a_csc, b_csr)
```
Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.
This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
# Context
In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges.
Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set.
# This PR
On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836)
Specifically, for each local rank `i`:
1. The parent process sets its own CPU affinity to what local rank `i`'s should be.
2. Then, the parent spawns the subprocess for local rank `i`.
3. Finally, the parent resets its own CPU affinity to what it was originally.
There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple.
This required a bit of refactoring:
1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`.
2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.).
3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
4. Use the context manager for both `str` and `Callable` paths
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183
Approved by: https://github.com/d4l3k
Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel.
Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
can succeed after this change.
Rollback Plan:
Differential Revision: D80285441
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671
Approved by: https://github.com/muchulee8
Summary:
[The diff was reverted due to CLA error, in the process of retrieving account]
Previous error message
```
RuntimeError: Expected input at *args.<unknown location>.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)
```
New error message
```
RuntimeError: Expected input at *args.[0].supervision_input.weight.shape[0] to be equal to 4096, but got 7680. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)
```
Test Plan:
```
buck test mode/opt apf/rec/ir/tests:ir_export_deserialize_test
```
https://www.internalfb.com/intern/testinfra/testrun/4785074906254375
```
buck run mode/opt caffe2/test:test_export -- -r unflatten
```
```
Ran 413 tests in 208.414s
OK (skipped=1, expected failures=13)
```
Rollback Plan:
Differential Revision: D80487367
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160919
Approved by: https://github.com/angelayi
Users encountered unexpected behaviour when using FlexAttention with learnable biases, including assertion errors (#157677)
We traced the root cause to the registration of subgraph buffers—this caused inconsistencies in the naming and ultimately incorrect retrieval later on. This problem only arose if the model was compiled as a whole (ie using @torch.compile) since only then would there be naming conflicts.
In this PR, we register the buffers with the base graph to solve this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161170
Approved by: https://github.com/drisspg
Fixes#160743
The MPS impl of `avg_pool2d` seems to only give incorrect results when `ceil_mode=True`. I wrote a performance measurement script (0ee6e58643/avg_pool_mps/perf_2d.py) which tests a bunch of different cases and also marks the cases where MPS and CPU results do not match.
I found that if I update `avg_pool2d` to use the new Metal kernel in all cases, that fixes all the mismatches, but it also decreases performance for some of the `ceil_mode=False` cases. So I opted to only run the new Metal kernel when `ceil_mode=True`, which does not significantly decrease performance in any of the cases tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161011
Approved by: https://github.com/malfet
# Problem
The FX converter previously supported graph outputs which were `StorageBox`, but not `TensorBox`. The latter seems to show up in certain cases when the output is a slice/view of the input.
# Fix
This PR generalizes the code to handle `MutableBox` instead of `StorageBox` specifically.
# Test
Added a CI test exposing the issue. The test case was found by intentionally breaking `TensorBox(ReinterpretView` support in https://github.com/pytorch/pytorch/pull/161258.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161245
Approved by: https://github.com/angelayi
Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222
Approved by: https://github.com/titaiwangms
Summary:
Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL.
Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on!
Test Plan:
Ran trace with GC induced and saw it on trace
Also added a test
Rollback Plan:
Differential Revision: D80491146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209
Approved by: https://github.com/ngimel
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter
```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace
python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace
python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead.
This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
## Test Result
```bash
lintrunner --take MYPY test/test_numpy_interop.py
Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556
Approved by: https://github.com/soulitzer
Summary:
We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py`
However, it is not recognized as an allowed file format and a warning will be thrown.
Test Plan:
CI
Rollback Plan:
Differential Revision: D80740916
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203
Approved by: https://github.com/angelayi
Fixes#158076
Basically, the gemm template generates code like
```
cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
&(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
&(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
&(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
static_cast<int64_t>(m_end + ((-1LL)*m_start)),
static_cast<int64_t>(Nr),
static_cast<int64_t>(k_end + ((-1LL)*k_start)),
static_cast<int64_t>(196LL),
static_cast<int64_t>(80LL),
static_cast<int64_t>(Nc_blocks*Nr)
);
```
However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access.
The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly.
I think a good way to fix this issue is to create a new matrix that has no storage offset.
When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue.
BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct.
The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.
I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233
Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.
Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n x = 1/0\n ~^~\nZeroDivisionError: division by zero\n']
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096
Approved by: https://github.com/c00w
Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like
```
[7019/7028] Linking CXX shared library bin\torch_hip.dll
FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64)
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64))
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64))
```
The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available.
I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373
Approved by: https://github.com/alugorey, https://github.com/jeffdaily
Summary:
Removed `Model`, it's not being used anywhere so it's safe.
Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram`
- BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields
- FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()`
Differential Revision: D80725094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185
Approved by: https://github.com/SherlockNoMad
By changing dtype to float if device is MPS
Note: for some reason test runs much longer on MPS than on CPU
```
% python ../test/test_indexing.py -v -k test_index_put_accumulate_duplicate_indices_mps
test_index_put_accumulate_duplicate_indices_mps (__main__.TestIndexingMPS.test_index_put_accumulate_duplicate_indices_mps) ... ok
----------------------------------------------------------------------
Ran 1 test in 9.139s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161201
Approved by: https://github.com/dcci
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.
After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).
The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator. The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.
Fixes#159991
confirmed docs rendered OK
<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
setup vllm test logics.
1. install wheels generated from previous build stage
2. generate and install vllm test pkg list on run time based on the torch wheels in the instance
3. run test based on the pre-defined test plan
notice the test-plan format is temporary for some basic vllm testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160361
Approved by: https://github.com/atalman, https://github.com/huydhn
Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.
Test Plan:
tested this case on a job e2e
added a unit test
Rollback Plan:
Differential Revision: D80273616
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660
Approved by: https://github.com/sibuachu
Summary:
The `reserve()` method is used to pre-allocate memory for the result vector before adding elements to it. This is an optimization that makes sense for several reasons:
1. Performance improvement: By pre-allocating memory for the exact number of elements needed, it avoids multiple reallocations and memory copies that would occur as the vector grows dynamically.
2. Memory efficiency: It ensures that the vector allocates exactly the amount of memory needed, no more and no less, which is efficient when we know the final size in advance.
3. Reduced overhead: Each reallocation typically involves:
- Allocating a new, larger block of memory
- Copying all existing elements to the new location
- Destroying the old elements
- Deallocating the old memory block
- Consistent performance: Without reservation, vector growth typically follows a geometric progression (like 1, 2, 4, 8, 16...), which can lead to unpredictable performance spikes when reallocation occurs.
Test Plan:
OSS CI & tests
Rollback Plan:
Differential Revision: D80674453
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161128
Approved by: https://github.com/Skylion007
This adds a new function `bypass_package` and `CompilePackage.bypass_current_entry()`. This allows us to safely bypass if there are models with unserializable or incompatible parts. When we encounter something incompatible, we'll raise a bypass and ignore that particular code in DynamoCodeEntry.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160902
Approved by: https://github.com/zhxchen17
expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161136
Approved by: https://github.com/kwen2501
**Summary**
When output dtype is fp8, oneDNN does not ensure intermediate results in the range of [-448, 448] before converting to fp8. So, we may get NaN in the output, which is a disaster for inference. This PR fixes this issue by clamping the intermediate results by oneDNN's post-op clip.
**Test plan**
```
pytest -sv test/quantization/core/test_quantized_op.py -k "q and fp8"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160957
Approved by: https://github.com/Valentine233, https://github.com/CaoE
The motivation for this change can be seen through the following example:
```
import torch
GPU_TYPE = "cuda"
@torch.compile
def no_override(x):
return x.sum(dim=0)
@torch.compile
def override(x):
return x.sum(dim=0)
x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```
Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:
```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
xnumel = 16384
rnumel = r0_numel
```
With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:
```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
xnumel = 1024000
rnumel = r0_numel
```
This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:
```
f(s0) -> f(s2)
f(s1) -> f(s2)
```
could generate different kernels. With the new approach, an explicit override pins the chosen configuration:
```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```
ensuring consistent kernel generation regardless of input order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
This is not needed on drivers >=525, and in DriverAPI::get() we are initializing the context anyway, so setting environment variable after that is beside the point
As a result of calling DriverAPI::get on systems that don't have gpus available (e.g. due to CUDA_VISIBLE_DEVICES="") people were getting confusing errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161103
Approved by: https://github.com/eqy, https://github.com/malfet
The existing logic here to workaround dealing with SFINAE under Microsoft platforms also applies to libc++ platforms. It appears that nvcc reports ambiguity in overload resolution for `pow_`. This seems like a nvcc limitation.
```
fbcode/caffe2/aten/src/ATen/native/cuda/Pow.cuh(42): error: more than one instance of overloaded function "pow" matches the argument list:
function template "std::__2::enable_if<<expression>, std::__2::__promote<_A1, _A2, void>>::type::type pow(_A1, _A2) noexcept" (declared at line 848 of fbcode/third-party-buck/platform010-libcxx/build/libcxx/include/c++/v1/math.h)
function template "std::__2::enable_if<<expression>, std::__2::__promote<_Tp, _Up, void>>::type pow(_Tp, _Up) noexcept" (declared at line 11308 of fbcode/third-party-buck/platform010/build/cuda/12.4/bin/..//include/crt/math_functions.h)
argument types are: (double, float)
return ::pow(base, exp);
^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161101
Approved by: https://github.com/malfet
Summary: use `self.inductor_meta_common()` to call the static method, since the custom subclasses may overwrite the method to be an instance method
Test Plan:
```
caffe2/test/inductor:select_algorithm -- test_finalized_subclass_hooks
```
Rollback Plan:
Differential Revision: D80375351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160895
Approved by: https://github.com/eellison, https://github.com/blaine-rister
Fixes#152952
Replace `_device_t` with `torch.types.Device` in `torch/cpu/__init__.py`. Did basic smoke test by running tests that `import torch.cpu` including `test/distributed/test_c10d_functional_native.py` and `test/test_decomp.py`.
Based this PR off of #152935 which is referenced in the main issue.
(also, this is my first contribution but I followed the contributing guide closely)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161031
Approved by: https://github.com/janeyx99
Summary: Similar to #ifdef checks added in addmm_impl_cpu_ to conditionally enable ACL, we add the same checks in bmm_out_or_baddbmm_. This essentially disables ACL for bmm_out_or_baddbmm_ and enables ArmPL, which seems to be performing better.
Test Plan: AR SL
Rollback Plan:
Reviewed By: Nicoshev
Differential Revision: D80494623
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161065
Approved by: https://github.com/q10
Per title, previously we started throwing noisy warnings, but given how popular this pattern was in our test suite decided to leave it as warning, not as silent behavior change for one release.
Now `treatSequenceAsTuple` would return `true` in the only case where the sequence was indeed a tuple, so no need for a special function anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160794
Approved by: https://github.com/albanD
Differential Revision: D79694055
Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and error using the FQN of the lifted constant
2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict
3. We modify yolov3 to fix the previous silent incorrect behaviour
When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159923
Approved by: https://github.com/avikchaudhuri
std::unique_ptr to decrease bytes from 24 to 8
Since std::unique_ptr is not copyable this required defining the copy / copy assignment constructors. Which made me realize we shouldn't be copying `tokens_` in those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160764
Approved by: https://github.com/albanD
Removing tb-nightly because we found issues when importing tensorboard as having both tb-nightly and tensorboard causes issues when pip would report 2.18.0 (pinned tensorboard) but importing in a python shell would report 2.13.XXX. This mismatch causes issues when running tests in a numpy2.X environment. e.g.
```
/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
/opt/venv/lib/python3.12/site-packages/redis/connection.py:77: UserWarning: redis-py works best with hiredis. Please consider installing
warnings.warn(msg)
/opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
_EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)
E
======================================================================
ERROR: test_event_handler (__main__.TestMonitorTensorboard.test_event_handler)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/var/lib/jenkins/pytorch/test/test_monitor.py", line 116, in setUp
from tensorboard.backend.event_processing import (
File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 25, in <module>
from tensorboard.backend.event_processing import (
File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 25, in <module>
from tensorboard.backend.event_processing import event_file_loader
File "/opt/venv/lib/python3.12/site-packages/tensorboard/backend/event_processing/event_file_loader.py", line 21, in <module>
from tensorboard import dataclass_compat
File "/opt/venv/lib/python3.12/site-packages/tensorboard/dataclass_compat.py", line 33, in <module>
from tensorboard.plugins.hparams import metadata as hparams_metadata
File "/opt/venv/lib/python3.12/site-packages/tensorboard/plugins/hparams/metadata.py", line 32, in <module>
NULL_TENSOR = tensor_util.make_tensor_proto(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/tensorboard/util/tensor_util.py", line 405, in make_tensor_proto
numpy_dtype = dtypes.as_dtype(nparray.dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py", line 677, in as_dtype
if type_value.type == np.string_ or type_value.type == np.unicode_:
^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/numpy/__init__.py", line 400, in __getattr__
raise AttributeError(
AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.
----------------------------------------------------------------------
Ran 1 test in 0.355s
FAILED (errors=1)
```
After removing tb-nightly and ensuring that tensorboard 2.18.0 is the only tensoboard in the env:
```
root@rocm-framework-47:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
.
----------------------------------------------------------------------
Ran 1 test in 0.409s
OK
```
```
>>> import tensorboard
>>> print(tensorboard.__version__)
2.13.0a20230426
```
```:/# pip show tensorboard
Name: tensorboard
Version: 2.18.0
Summary: TensorBoard lets you watch Tensors Flow
Home-page: https://github.com/tensorflow/tensorboard
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /opt/venv/lib/python3.12/site-packages
Requires: absl-py, grpcio, markdown, numpy, packaging, protobuf, setuptools, six, tensorboard-data-server, werkzeug
Required-by:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160996
Approved by: https://github.com/huydhn
Summary: Because `includeBool` is already a small value type (i.e., `bool`, 1 byte) that's passed by value to the function. Capturing by reference (4 or 8 bytes depending on the system) is unnecessary and could potentially lead to dangling reference issues if the lambda outlives the original variable. Capturing by value is more efficient for small types and safer.
Test Plan:
OSS CI & tests
Rollback Plan:
Differential Revision: D80595698
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161042
Approved by: https://github.com/Skylion007
- hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count.
- dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work.
- hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932
Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet
Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648
Replace `std::min` to ternary statement
Also `std::min` can be replaced by explicit typing `std::min<int64_t>`
fixes on ROCm:
test_sort_and_select.py::TestSortAndSelectCUDA::test_sort_large_cuda_float16
error:
RuntimeError: Cannot sort dimension of length 8192
Similar PR to fix large tensors on ROCm https://github.com/pytorch/pytorch/pull/130994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161054
Approved by: https://github.com/jeffdaily
Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops.
The implementation strategy is:
1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC.
2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API.
Test Plan: CI
Differential Revision: D75623875
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650
Approved by: https://github.com/ezyang, https://github.com/zou3519
# why
- head is broken
# what
- the template for experimental API is broken
- the test assumes not experimental API
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_regular_mm_persistent_tma_strided_a_transposed_True_b_transposed_False_dynamic_True -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161025
Approved by: https://github.com/PaulZhang12
Today convert_frame is implemented like the following:
```
def _compile():
tracer_output = None
def transform():
nonlocal tracer_output
...
def _compile_inner():
transform(...)
compile_inner(...)
```
The code is using unconventional nonlocal variable as the return value. This is not ideal for 2 reasons:
1. Reasoning about the code, especially together with error handling code becomes harder.
2. more importantly, this makes it harder to extract out common code pieces into a shared library because everything must depend on a central global state.
In this diff we remove the usage of nonlocal return and just use the conventional function return to output the compilation data.
Differential Revision: [D80461258](https://our.internmc.facebook.com/intern/diff/D80461258/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160899
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #160814, #160815, #160855
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).
This PR adds a new helper function compile_frame() which takes a bytecode and a transform function and return compiled bytecode + output graph as DynamoOutput type.
Differential Revision: [D80430802](https://our.internmc.facebook.com/intern/diff/D80430802/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160855
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #160814, #160815
This pull request adds the following ops for sparse matrices using Eigen library:
```python
add(a_csr, b_csr)
add(a_csc, b_csc)
addmm(c_csr, a_csr, b_csr)
addmm(c_csr, a_csr, b_csc)
addmm(c_csr, a_csc, b_csc)
addmm(c_csr, a_csc, b_csr)
addmm(c_csc, a_csr, b_csr)
addmm(c_csc, a_csr, b_csc)
addmm(c_csc, a_csc, b_csc)
addmm(c_csc, a_csc, b_csr)
```
Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.
This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
This fixes the case when an input / output contains both zero strides and singleton dimensions. In this case the broadcasting dimensions generated for the descriptor need to ignore dimensions that have zero strides with size 1, otherwise the determination of which dimensions to broadcast will fail.
As an example, consider the following store instruction:
```
name=buf1
index=x2 + 192*y0 + 64*y1
valule=TritonCSEVariable('tmp7')
params = BlockParameters(
shape=[3, 4, 1, 1, 64],
block_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), 1, 1, XBLOCK],
strides=[64, 192, 0, 0, 1],
offsets=[(yoffset//4), ModularIndexing(yoffset, 1, 4), 0, 0, xoffset]
)
broadcasting_dims=[False, False, True, True, False]
broadcast_shape=[((YBLOCK + 3)//4), Min(4, YBLOCK), XBLOCK]
```
Because `len(self.broadcasting_dims) != self.broadcast_shape)`, dim3 is incorrectly
marked as a broadcast dimension when the pre-broadcast shape is computed in `codegen_broadcast_and_reshape`.
```
9 pre_broadcast_shape = [
280 sympy.S.One if is_broadcasting else dim
281 for dim, is_broadcasting in zip(
282 -> self.broadcast_shape, self.broadcasting_dims
283 )
284 ]
```
The pre_broadcast_shape is now wrong: `[((YBLOCK + 3)//4), Min(4, YBLOCK), 1]`
Triton throws the following error: `reshape() cannot change total number of elements in tensor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160310
Approved by: https://github.com/blaine-rister
The motivation for this change can be seen through the following example:
```
import torch
GPU_TYPE = "cuda"
@torch.compile
def no_override(x):
return x.sum(dim=0)
@torch.compile
def override(x):
return x.sum(dim=0)
x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```
Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:
```
def triton_per_fused_sum_1(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr):
xnumel = 512
r0_numel = 32
```
With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:
```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
xnumel = 16384
r0_numel = 128
```
This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:
```
f(s0) -> f(s2)
f(s1) -> f(s2)
```
could generate different kernels. With the new approach, an explicit override pins the chosen configuration:
```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```
ensuring consistent kernel generation regardless of input order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
Summary:
so we use this constructor in HigherOrderKernel. problems arise in the loop condition, where it's possible for an output from the prev. iteration to be an input to the next. so the Output(N) of a kernel may be the Input(M) to a kernel in the next iteration. Thus, if the output value is reset (via. fastresizetozero) or overwritten by a prev. kernel before it is to be used, we have major major issues.
we need to enforce that outputs are moved, not copied, to ensure this doesn't happen.
Test Plan:
buck2 test //caffe2/test:test_export --local-only -- test_while_loop_tensor_constant_idx_cpp_runtime_nonstrict
Rollback Plan:
Differential Revision: D80565374
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161005
Approved by: https://github.com/SherlockNoMad
Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns.
Test Plan:
sandcastle tests
Rollback Plan:
Differential Revision: D80549014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986
Approved by: https://github.com/pradeepfn
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).
This PR follows the last one which separate out the part to run instruction translator on a given frame and return a DynamoTracerOutput.
The end result is a free function that runs instruction translator indepedently. A follow up diff will wrap the low level function.
Differential Revision: [D80388694](https://our.internmc.facebook.com/intern/diff/D80388694/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160815
Approved by: https://github.com/anijain2305
ghstack dependencies: #160814
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. To this end, I have added three test cases, one to test input device movement and the other two to test parameter registration during the forward and backward pass of a model.
**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_root_move_forward_input_to_device
2. pytest test/distributed/_composable/test_replicate_training.py -k TestReplicateRegisteredParams
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160147
Approved by: https://github.com/weifengpy
ghstack dependencies: #160135, #160136
`tensor.view` share the same `data_ptr()` as the original tensor, thus cannot serve as key to rendezvous' map (we want a 1:1 match between handle and tensor, thus need a unique key).
@ezyang suggests using the raw `TensorImpl*` of a tensor, for which `tensor.view` would have a different value than the original tensor.
But the raw `TensorImpl*` can be stumbled on again when a previous tensor gets deallocated and a new one allocated. For that reason, we'd also need to use a `weak_instrusive_ptr` to distinguish the two tensors, i.e. for the deallocated tensor, `weak_instrusive_ptr::expired()` would return true.
Added `test_rendezvous_view` and `test_rendezvous_same`.
Note: the view support has been added to NVSHMEM backend and NCCL backend. For CUDA backend, I have yet to investigate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160925
Approved by: https://github.com/ngimel
ghstack dependencies: #160825
Summary:
Joint graph passes run several FX passes which can modify the graph before it hits Inductor.
There's three usages of joint graph passes:
- **for inference & not freezing** (we add structured loggings only for this)
- for inference & freezing
- for fw/bw split
Rollback Plan:
Reviewed By: yushangdi
Differential Revision: D80130321
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160589
Approved by: https://github.com/yushangdi
TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes.
This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear.
Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t).
I then changed the test to test the scalar_type API.
This code change required some refactoring because of circular dependencies.
## BC Breaking note
This commit is (narrowly) BC-breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
Summary:
Previously we will pass all serialized data to dataclass ctors.
Now we just loop over all the existing fields in dataclass and fetch only the field we need to run ctor.
This should help with the case when we deserializing a buffer with new field.
Test Plan:
CI
Rollback Plan:
Differential Revision: D80487716
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160918
Approved by: https://github.com/angelayi
Summary:
When using inductor pattern matcher to replace graphs, the graph generated by replacement function can be missing `original_aten` metadata for the replaced nodes. This further results in inductor failing to generate a sensible kernel name, eg. `tri_poi_fused_0` , missing the aten op name.
This diff attempts to fix that by allowing tracing the graph in replacement function with `preserve_node_meta`. Included this as an option to turn on in `pattern_matcher.fwd_only` function.
Can confirm that with the fix, MTIA's pattern matcher replaced original graph with a node that has original_aten meta, and inductor generated kernel name has op name.
Test Plan:
added kernel_name check to afg_inductor_test silu test
Rollback Plan:
Differential Revision: D80183670
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160542
Approved by: https://github.com/eellison, https://github.com/bdhirsh
When we search for a NVSHMEM allocation backing a tensor, don't limit it to an exact match between `tensor.data_ptr()` and `allocation.base_ptr`. Instead, test whether the former is within an allocation range, i.e. [base_ptr, base_ptr + size).
This PR also squashed in original base PR #160795:
Since (i) `handle = rendezvous(tensor)`, and (ii) we pass `handle->buffer_ptrs` to kernels, `handle` should carry the `data_ptr()` of tensor instead of the base address of a memory allocation (previous case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160825
Approved by: https://github.com/Skylion007, https://github.com/ngimel
After https://github.com/pytorch/pytorch/pull/160635, I can see dependabot creating the PR to bump `transformers` version at https://github.com/pytorch/pytorch/pull/160807. This a good start, but there are several tweaks we need:
1. Run inductor tests on the PR including one round of perf benchmark, which is always needed. So, we need `ciflow/inductor` label and a `pull_request` trigger for the benchmark
2. Per @anijain2305 feedback, we don't need to update patch version. So, I add a rule to ignore it. Again, we would need to test this out after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160935
Approved by: https://github.com/anijain2305
Summary:
LLVM has a warning `-Wunused-local-typedef` which we are enabling to remove unused code. This has the side-effect of making it easier to do refactors should as removing unnecessary includes.
For questions/comments, contact r-barnes.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan:
Sandcastle
Rollback Plan:
Differential Revision: D80511128
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160944
Approved by: https://github.com/cyyever, https://github.com/Skylion007
This diff makes it so that the portion saving guards that can throw is completely separated from GuardBuilder, and instead in `serialize_guards`. This lets me add a try catch around it for caching precompile later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160662
Approved by: https://github.com/zhxchen17
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.
Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`
Rollback Plan:
Differential Revision: D80348643
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160747
Approved by: https://github.com/NikhilAPatel
Summary: out variant has to be strided like self. since memory format isn't provided, this should be equivalent.
Test Plan:
prev. when we enable static dispatch this test would have numeric issues
```
buck2 test //caffe2/test:test_export -- test__scaled_dot_product_flash_attention_cpp_runtime_nonstrict --print-passing-details
```
Rollback Plan:
Reviewed By: SherlockNoMad
Differential Revision: D80191085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160560
Approved by: https://github.com/SherlockNoMad
# Context
Another fix to enable broad rollout of #149334.
The implementation assumes that the trainer process with local rank `n` only uses device `cuda:n`. However, there are sometimes jobs with more than one GPU per process, in which case our assumption could be incorrect and actually lead to worse memory locality.
# This PR
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160848
Approved by: https://github.com/kiukchung
Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden.
I also opened a PR to use this timing measurement function on TritonBench side: https://github.com/meta-pytorch/tritonbench/pull/333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160921
Approved by: https://github.com/BoyuanFeng
[fx_graph_cse](https://github.com/pytorch/pytorch/blob/main/torch/_functorch/compile_utils.py#L46) is executed in min_cut partitioner which accidentally creates the aliasing for empty buffers and we could see the following graph node for joint graph with cmd: "pytest test/functorch/test_control_flow.py -k test_scan_multiple_layers_gradient_layers_2_device_cpu"
```python
while_loop = torch.ops.higher_order.while_loop(while_loop_cond_graph_0_0, while_loop_body_graph_0_0, (full_default_4, empty_strided_default, full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default, rev, rev_1, rev_2, rev_3), (primals_4, primals_5, primals_6, primals_7));
```
Notice the operands sequence **"full_default_2, full_default_3, full_default_2, full_default_3, full_default, full_default"**, which indicates the gradient of different layers now sharing the same buffer, which create silent incorrectness.
Fixes https://github.com/pytorch/pytorch/pull/158168.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160668
Approved by: https://github.com/zou3519
ghstack dependencies: #160548, #160374
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_collectives to skip reduce_scatter in the foreach_reduce API when world_size = 1. I have created edited a test that uses CommDebugMode to verify that the reduce_scatter has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. I have also added a test command.
**Test Cases**
1. pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_single_worldsize1
2. pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_tp_with_fsdp_offloading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160136
Approved by: https://github.com/weifengpy
ghstack dependencies: #160135
Summary: Currently the implementation of [fbgemm_linear_fp16_weight](https://www.internalfb.com/code/fbsource/[ffe8ba561cb6af33fde5b32c27411d6d3f4f2c70]/fbcode/caffe2/aten/src/ATen/native/QuantizedLinear.cpp?lines=477) does not allow None for `bias`, but it's actually a valid case and internally `fbgemm_linear_fp16_weight_fp32_activation` accept None bias as well. For BC reason, we can't directly change the function signature. So wrapping an empty tensor if bias is None to workaround it in Sigmoid.
Test Plan:
P1906210273
```
MODEL_TYPE=dpa_product_first_ctr_model
MODEL_ENTITY_ID=778442870
SNAPSHOT_ID=6
MODULE=user
SUFFIX=.predictor.precompute.remote_request_only
buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice="" --benchmarkDontRebatchSamples=true --doNotRandomizeSampleInputs=true --benchmarkNumIterations=10000 &> ~/logs/${MODEL_TYPE}/load_net_predictor_${MODEL_ENTITY_ID}_${SNAPSHOT_ID}_${MODULE}
```
Rollback Plan:
Reviewed By: henryoier, hl475
Differential Revision: D80382652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160802
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
We are refactoring dynamo code for convert frame so that we can have modularized pieces sharable between different compiler frontends (e.g. torch.compile, precompile and torch.export).
One incremental step we can take is to refactor out InstructionTranslator as a functional piece providing bytecode tracing.
To separate out this part, we notice currently the tracer object is being passed around in the entire convert frame compile function. This is not very ideal because we want to build a boundary between the tracing and downstream compiler stack. Ideally, we should extract all the relevant information out of the tracer object and return a new data structure that is free of internal states of InstructionTranslator.
Luckily, there aren't many data used from tracer, after tracing is finished. The major one is OutputGraph, other than that, we only need to record two boolean flags for error handling purposes.
The new type we're adding is called DynamoTracerOutput, which contains all the information needed by torch.compile internal after symbolic convert is finished. To simplify the current PR, we leave out the part which reduce OutputGraph into a minimal set, since this can be done in a separate PR.
Differential Revision: [D80388693](https://our.internmc.facebook.com/intern/diff/D80388693/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160814
Approved by: https://github.com/tugsbayasgalan
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_ROOT_DIR%\Scripts\activate.bat %CONDA_ROOT_DIR%\envs\py_tmp` in Command Prompt before running Git Bash.
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.