Summary:
**Changes**
* Deleted function import from build definition utilities
* Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
* Replaced is_arvr_mode() function calls with direct references to configuration flags
* Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
* Changed conditional expressions to Buck `select()` statements
Test Plan:
Check if CI passes
Rollback Plan:
Differential Revision: D78520947
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet
Fixes#147140
## Changes
- Add `to` implementation in `MaskedTensor` to support move `mask` to target device
## Test Result
```python
In [1]: import torch
...: from torch.masked import as_masked_tensor
...: data = torch.tensor([1,2,3])
...: mask = torch.tensor([True,False,True])
...: mt = as_masked_tensor(data, mask).to('cuda')
...: mt.get_data().device, mt.get_mask().device
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
return MaskedTensor(data, mask)
/home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
return MaskedTensor(new_data, _maybe_get_mask(args[0]))
Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0))
In [2]: mt.sum(dim=0)
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
return MaskedTensor(data, mask)
Out[2]: MaskedTensor(4, True)
```
```bash
pytest test/test_maskedtensor.py -vv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205
Approved by: https://github.com/ezyang
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.32.4</h2>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
<li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
</ul>
<h2>v2.32.3</h2>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS.</li>
<li>Dropped support for pypy 3.9 following its end of support.</li>
</ul>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="021dc729f0"><code>021dc72</code></a> Polish up release tooling for last manual release</li>
<li><a href="821770e822"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li>
<li><a href="59f8aa2adf"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li>
<li><a href="5b4b64c346"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li>
<li><a href="7bc45877a8"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li>
<li><a href="96ba401c12"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li>
<li><a href="7341690e84"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li>
<li><a href="6716d7c9f2"><code>6716d7c</code></a> remove links</li>
<li><a href="a7e1c745dc"><code>a7e1c74</code></a> Update docs/conf.py</li>
<li><a href="c799b8167a"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006
Approved by: https://github.com/Skylion007
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are:
1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`.
2. Removing two unused parameters from `torch.export._unlift._unlift`.
3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency.
4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075
Approved by: https://github.com/Skylion007
Implements collective alltoall operation for NVSHMEM Triton kernels. Enables data exchange where each PE sends unique data to every other PE in the team.
Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_alltoall`
<details>
<summary>Quick debug print for sanity check</summary>
```markdown
============================================================
[Rank 0] Starting alltoall test with world_size=2
============================================================
[Rank 0] Configuration:
- nelems_per_pe: 2
- dtype: torch.int64, element_size: 8 bytes
- nelems_bytes: 16
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
[Rank 0] Preparing source data:
[Rank 1] Preparing source data:
- Data for PE 0: [0, 0] (indices 0-1)
- Data for PE 1: [1, 1] (indices 2-3)
[Rank 0] Complete source buffer: [0, 0, 1, 1]
- Data for PE 0: [100, 100] (indices 0-1)
- Data for PE 1: [101, 101] (indices 2-3)
[Rank 1] Complete source buffer: [100, 100, 101, 101]
[Rank 1] Initial destination buffer: [-1, -1, -1, -1]
[Rank 0] Initial destination buffer: [-1, -1, -1, -1]
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[rank0]:[W716 15:30:06.215666766 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank1]:[W716 15:30:06.215752786 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
NCCL version 2.27.5+cuda12.4
[Rank 1] Executing alltoall operation...
[Rank 0] Executing alltoall operation...
[Rank 1] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[Rank 0] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[Rank 0] Results after alltoall:
[Rank 1] Results after alltoall:[Rank 0] Destination buffer: [0, 0, 100, 100]
[Rank 0] Verifying results:
- From PE 0 (indices 0-1):
Expected: [0, 0]
Actual: [0, 0]
[Rank 1] Destination buffer: [1, 1, 101, 101]
[Rank 1] Verifying results:
- From PE 0 (indices 0-1):
Expected: [1, 1]
Actual: [1, 1]
Match: ✓
Match: ✓
- From PE 1 (indices 2-3):
Expected: [100, 100]
- From PE 1 (indices 2-3):
Expected: [101, 101]
Actual: [100, 100]
Actual: [101, 101]
Match: ✓
Match: ✓
[Rank 0] ============================================================
[Rank 0] Summary: ALL TESTS PASSED ✓
[Rank 0] Data flow explanation:
- Each rank sends 2 elements to every other rank
[Rank 1] ============================================================
[Rank 1] Summary: ALL TESTS PASSED ✓
- Rank 0 sent: [0, 0, 1, 1]
[Rank 1] Data flow explanation:
- Each rank sends 2 elements to every other rank
- Rank 0 received: [0, 0, 100, 100]
- My data for PE 0 (0) went to PE 0's buffer
- I received PE 0's data for me (0)
- My data for PE 1 (1) went to PE 1's buffer
- Rank 1 sent: [100, 100, 101, 101]
- I received PE 1's data for me (100)
[Rank 0] ============================================================
- Rank 1 received: [1, 1, 101, 101]
- My data for PE 0 (100) went to PE 0's buffer
- I received PE 0's data for me (1)
- My data for PE 1 (101) went to PE 1's buffer
- I received PE 1's data for me (101)
[Rank 1] ============================================================
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158513
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512
This PR attempts to cache:
* codegen for cutlass backend for the same kernel. Even if runtime params are different.
From some profiling, most of the time spent is on render. So we only target to cache that part for now.
The output of render is `code`, and we are able to cache that easily. Also, I have to cache size_args, since it depends on `kernel.get_dynamic_shape_args()`, which depends on the state of self when we call render.
make_key is doing most of the work here: We are hashing on input node layouts, output node layout and op.configuration_name() (this is what hash(op) would do anyway).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156781
Approved by: https://github.com/ColinPeppler
Summary: This test often fails internally -- looks like it's because autotuning sometimes chooses not to do the epilog tuning. Turning off `benchmark_epilogue_fusion` seems to fix.
Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_cat_max_autotune_triton (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158589
Approved by: https://github.com/eellison
**Background**:
```Shell
torch 2.5.1+cpu
torchvision 0.20.1
```
```Python
import torch
import torchvision
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```
**Cause**:
```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.
It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
Fixes https://github.com/pytorch/pytorch/issues/158382
```
renamed: torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py -> torch/_functorch/_aot_autograd/graph_capture.py
renamed: torch/_functorch/_aot_autograd/traced_function_transforms.py -> torch/_functorch/_aot_autograd/graph_capture_wrappers.py
renamed: torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py -> torch/_functorch/_aot_autograd/graph_compile.py
```
Everything else is ONLY import changes. I did not rename any functions
even if we probably should have.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158449
Approved by: https://github.com/jamesjwu
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
Adds `sync_all()` function for local store visibility synchronization in NVSHMEM Triton kernels. Provides memory ordering for local operations without remote completion guarantees.
Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_sync`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158512
Approved by: https://github.com/fduwjj
ghstack dependencies: #158511
This one
```
Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal to Pooling_30.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:172:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.
```
Although functionally one is not supposed to hit this codepath ever, it's not not to throw warning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158729
Approved by: https://github.com/Skylion007
Adds device-side barrier synchronization and PE identification functions for NVSHMEM Triton integration. Includes `barrier_all()` for collective synchronization and `my_pe()`/`n_pes()` for PE identification within kernels.
We are launching with cooperative grid launch (for all the PRs in this stack) because the `nvshmemx_collective_launch` function must be used to launch kernels on the GPU when the kernels use NVSHMEM synchronization or collective APIs, and `nvshmemx_collective_launch` essentially boils down to a CUDA cooperative group launch.
Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_barrier`
Also tested that if you remove the barrier, you get an assertion error/race conditions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158511
Approved by: https://github.com/fduwjj
This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds:
- `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually.
- A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270
Approved by: https://github.com/eellison
Summary: export_for_training exist because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we should deprecate and delete this API.
Test Plan:
CI
Rollback Plan:
Differential Revision: D78240836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158203
Approved by: https://github.com/JacobSzwejbka
Related to https://github.com/pytorch/pytorch/issues/157517
Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.
Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
Found <GPU Name> which is of cuda capability 5.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is 7.0.
warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
Please see https://github.com/pytorch/pytorch/issues/157517
Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```
Please note I reverted original PR https://github.com/pytorch/pytorch/pull/158301 because it broke internal users. This is a reland, added added check for non empty torch.cuda.get_arch_list()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158700
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/eqy
This PR introduces the rest of the keyword-arguments added in DLPack
version 2023.12: `dl_device` and `copy`.
In summary, we handle these arguments in the C++ implementation of
`to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the
`maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also
introduces the following changes:
- Add a new Python API `torchDeviceToDLDevice()`, which is simply a
refactoring of the `getDLDevice()` function at
_aten/src/ATen/DLConvertor.cpp_.
- Add both keyword-arguments to the `from_dlpack()` function at
_torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder
method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218
Approved by: https://github.com/albanD
ghstack dependencies: #150216, #150217
https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise.
This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569.
This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572
Approved by: https://github.com/danielvegamyhre, https://github.com/atalman
By setting the name of the domain libraries to build via `BUILD_ADDITIONAL_PACKAGES` environment variable, the build job will build them and make them available as artifacts in the same way as the PyTorch CI wheel. To ensure that this doesn't break CI, the test job will still build them as usual if the wheels are not there. Building dependencies like FBGEMM on the test job is bad, especially for GPU jobs, because it leave the GPU resource idle
Fixes https://github.com/pytorch/pytorch/issues/152024
Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158600
Approved by: https://github.com/yangw-dev
ghstack dependencies: #158598, #158599
The previous strategy directly used 'self' input strategy for 'src'
input. The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.
E.g. for this program, broadcasting will occur on dims 0,1,3 of self.
```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```
These are the correct sharding combinations:
| self | src |
|-------|------|
| Shard(0) | Replicate() |
| Shard(1) | Replicate() |
| Shard(2) | Shard(0) |
| Shard(3) | Shard(1) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158490
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
by mixing input_specs that were frozen from
select_strategy.strategies[0] with output_specs that varied across
select_strategy.strategies[0..N] (e.g. we could create a nonsense
strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
should be the cost of redistributing our single input from another
strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
tensor's strategies directly onto 'src' tensor without accounting for
the fact that copy_ supports broadcasting a smaller rank tensor into a
larger one.
Separates out copy_ op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR
Renames to `propagate_single_input_strategy` since that's more
descriptive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Summary:
Fix the error that occurs in the devarm environment when compiling with Clang:
```
caffe2/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu:97:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
97 | virtual at::Tensor get_buffer(int
| ^
caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
| ^
1 error generated.
```
Test Plan:
See D78520305
Rollback Plan:
Differential Revision: D78517953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158597
Approved by: https://github.com/janeyx99
I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning.
So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor.
There is a behavior change if someone calls a compiled graph with simplefilter("error"):
```python
# e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing
with warnings.catch_warnings():
warnings.simplefilter("error") # turns all warnings into errors
compiled_fn() # will throw if any of the muted warnings fire
```
FIXES https://github.com/pytorch/pytorch/issues/128427
A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520
Approved by: https://github.com/anijain2305, https://github.com/zou3519
Summary: This allows us to start alerting on cache failures, based on scuba data
Test Plan:
Added new tests explicitly for the Remote Cache API.
Note that we have existing tests for memcache, but not for manifold AFAICT.
There are two potential wrinkles. One we're adding a new field (and everything uses ScubaData AFAICT, so this should just work).
The other one is the implicit api contract that if the sample is None, then it will be ignored (and not crash). I believe the second one is implemented correctly (and tested). The first one is a little more nebulous, but I think won't cause any breakages.
Also manually ran a compile and made sure it didn't break - P1851504490 as well as forcing it to break and checking we didn't screw up the exception handling - P1851504243
Rollback Plan:
Differential Revision: D77054339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156874
Approved by: https://github.com/oulgen, https://github.com/masnesral
Summary: Small change to make the `times` and `repeat` variables controllable as command line args.
Test Plan:
Execute:
```
buck2 run <run params> <path>:inductor_benchmark -- --times=1 --repeat=1
```
Only runs once, and without passing the args it runs with default values of 10.
Rollback Plan:
Reviewed By: malfet
Differential Revision: D78458680
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158590
Approved by: https://github.com/FindHao, https://github.com/malfet
Adds a pre-commit hook (technically a pre-push hook) to the PyTorch repo.
**This is currently an opt-in feature**, which one can opt into by running `python scripts/setup_hooks.py` locally.
### Features
- **Run Lintrunner Before Push**: Before every `git push`, automatically runs lintrunner on your changes.
- Really need to skip the checks? Run `git push --no-verify`
- **Consistent, Isolated, Lintrunner Environment**: During pre-push, Lintrunner runs in it's own virtual en environment that contain all lintrunner dependencies in a consistent, isolated environment. No more lintrunner failures because you created a new .venv. (Did you know you needed to run `lintrunner init` every time you make a new .venv?)
- **Dependencies Automatically Updated**: If .lintrunner.toml is updated, this will automatically re-run `lintrunner init` to ensure you install the latest dependencies specified
### Installation
- Run `python scripts/setup_hooks.py`. Now every `git push` will first run lintrunner.
### Additional details
- The lintrunner used by the pre-push hook runs in a special per-repo virtual environment managed by the commit-hook tool located under `$USER/.cache/pre-commit`
- Does not affect your regularly used lintrunner
- Manual invocations of lintrunner will continue to depend on your local environment instead of the special pre-push one. If there's enough interest, we could explore consolidating them.
- Does not run `lintrunner -a` for you.
- You still need to manually run that (can be changed later though!)
- Have staged/unstaged changes? No worries
- This runs `git stash` before running the pre-commit hooks and pops back your changes afterwards, so only the changes actaully being pushed will be tested
### Downsides
- No streaming UI updates
- While you still get the same output from lintrunner that you're used to, the commit-hook framework doesn't show any output while lintrunner is actually running. Instead, it shows the entire output after linter has completed execution, which could be a few minutes (especially if it has to run `lintrunner init` first)
- `uv` installation is required to run the setup script. The setup script will ask users to install uv if it's not available.
- This is required to be able to install the pre-commit package in a safe way that's available no matter what .venv you are running in.
### Opting out
- Disable hook for a single push: Run `git push --no-verify`
- Disable hook permanently: If something goes wrong and you need to wipe your setup:
- Delete the `$USER/.cache/pre-commit` folder and the `.git/hooks/pre-push` file in your local repo.
- You can now rerun `python scripts/setup_hooks.py` to setup your git push hook again if you want.
### Potential Future Changes
Things that could be done to make this even better if folks like these ideas:
- Automatic setup
- Our `CONTRIBUTING.md` file tells devs to run `make setup-env`. That could be a good entry point to hook the installation into
- Fix the console output streaming
- Make every lintrunner invocation (including manual ones) use the same repo-specific venv that the commit-hook uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158389
Approved by: https://github.com/seemethere
Fixes#156587
This sets lower bounds for fsspec and networkx in both setup.py and requirements,txt.
- fsspec>= 0.8.5 (released December 15, 2020)
- netowrkx>= 2.5.1 (released April 3, 2021)
These are the first stable versions released after Python 3.9 came out on October 5, 2020. Since Python 3.8 is no longer maintained, setting these minimums helps ensure PyTorch won't be installed alongside unexpectedly old versions of these packages.
Tested with these versions locally to make sure they don't break anything. Adding CI for lower-bound testing could be a follow up later if need.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158565
Approved by: https://github.com/janeyx99
Summary: These tests failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.
Test Plan:
```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_object' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_values_with_graph_break' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only' --run-disabled
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158485
Approved by: https://github.com/williamwen42
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to an important set of utilities in dynamo, `repro/` and the base `debug_utils.py`
Running
```
mypy torch/_dynamo/repro/ torch/_dynamo/debug_utils.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 905 | 3268 | 27.69% | 22 | 81 | 27.16% |
| This PR | 3368 | 3368 | 100.00% | 81 | 81 | 100.00% |
| Delta | +2463 | +100 | +72.31% | +59 | 0 | +72.84% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158504
Approved by: https://github.com/mlazos
Summary: For case like caching_precompile, we almost always want to drop ID_MATCH-type guards since they will block serialization. This diff add this behavior when this global flag is toggled on so that ID_MATCH guards are excluded from compilation and serialization.
Test Plan:
test_dynamo -- -k test_id_match_with_config
Rollback Plan:
Differential Revision: D78363609
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158368
Approved by: https://github.com/jamesjwu
The architecture version checks are unnecessary fine-grained in PyTorch. Considering the fact that PyTorch's Flash Attention works on all `sm_80+` machines, it makes more sense to just check for lower bound.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158558
Approved by: https://github.com/eqy
The previous strategy directly used 'self' input strategy for 'src'
input. The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.
E.g. for this program, broadcasting will occur on dims 0,1,3 of self.
```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```
These are the correct sharding combinations:
| self | src |
|-------|------|
| Shard(0) | Replicate() |
| Shard(1) | Replicate() |
| Shard(2) | Shard(0) |
| Shard(3) | Shard(1) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
We want to do it for two reasons:
1. It's tedious for users to manually turn on capture_scalar_outputs=True when compiling map and scan with inductor, where we decomposing them into while_loop and use the idx tensor.item() to select a slice of output buffer and write into it. This pr turns on the flag by default.
2. a graph break caused by capture_scalar_outputs=False would cause the hop to fail, and we should turn it on by default so that the error message is more meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158480
Approved by: https://github.com/zou3519
Address second part of #158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158381
Approved by: https://github.com/zou3519
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.
For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call.
```
mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
- // fast case: tensor is live in python
- std::optional<PyObject*> mb_obj =
- t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
- if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
- return *mb_obj;
- }
- return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+ // fast case: tensor is live in python
+ std::optional<PyObject*> mb_obj =
+ t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+ /*ignore_hermetic_tls=*/false);
+ if (mb_obj.has_value() &&
+ !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+ return *mb_obj;
+ }
+ return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
by mixing input_specs that were frozen from
select_strategy.strategies[0] with output_specs that varied across
select_strategy.strategies[0..N] (e.g. we could create a nonsense
strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
should be the cost of redistributing our single input from another
strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
tensor's strategies directly onto 'src' tensor without accounting for
the fact that copy_ supports broadcasting a smaller rank tensor into a
larger one.
Separates out copy_ op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR
Renames to `propagate_single_input_strategy` since that's more
descriptive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #158495
Summary:
- Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and ` create_node_mapping_kernel_to_post_grad`
- Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace`
- Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node
- When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to.
- A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](a00442421a/torch/_inductor/fx_passes/group_batch_fusion.py (L299))
Example:
```
scheduling ExternKernelOut(
python_kernel_name='extern_kernels.mm',
name=buf0,
layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]),
inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView(
StorageBox(
ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1]))
),
FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]),
origins=OrderedSet([mm_default_1]),
stack_traces = {,
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
x = self.fc1(x),
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
return F.linear(input, self.weight, self.bias),
}
)],
constant_args=(),
kwargs={},
output_view=None,
python_kernel_name=extern_kernels.mm,
cpp_kernel_name=at::mm_out,
ordered_kwargs_for_cpp_kernel=(),
op_overload=None,
arg_properties=[{}, {}],
allarg_properties={},
kwarg_properties=None,
unbacked_bindings={},
mutation_outputs=[],
origin_node=mm_default_1,
origins=OrderedSet([mm_default_1]),
stack_traces = {,
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
x = self.fc1(x),
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
return F.linear(input, self.weight, self.bias),
}
)
```
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```
Rollback Plan:
Differential Revision: D78365534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158576
Approved by: https://github.com/angelayi
Include both the error stacktrace and the graphmodule in a new
structured trace artifact. Log the shortened version to the console,
and also log a hint to look at the tlparse for more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158469
Approved by: https://github.com/ezyang
If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place.
```
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list
Package Version
------------------ -----------
attrs 25.3.0
build 1.2.2.post1
certifi 2025.7.14
charset-normalizer 3.4.2
cmake 4.0.3
exceptiongroup 1.3.0
expecttest 0.3.0
filelock 3.18.0
fsspec 2025.5.1
hypothesis 6.135.32
idna 3.10
importlib_metadata 8.7.0
Jinja2 3.1.6
lintrunner 0.12.7
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.2.1
ninja [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4)
opt-einsum 3.3.0
optree 0.16.0
packaging 25.0
pip 25.1
psutil 7.0.0
pyproject_hooks 1.2.0
python-dateutil 2.9.0.post0
pytz 2025.2
PyYAML 6.0.2
requests 2.32.4
setuptools 78.1.1
six 1.17.0
sortedcontainers 2.4.0
sympy 1.14.0
tomli 2.2.1
typing_extensions 4.14.0
tzdata 2025.2
urllib3 2.5.0
uv 0.7.21
wheel 0.45.1
zipp 3.23.0
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4
Collecting numpy==1.22.4
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.22.4
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3
Collecting pandas==2.0.3
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4)
Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
Installing collected packages: pandas
Successfully installed pandas-2.0.3
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2
Collecting numpy==2.0.2
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.22.4
Uninstalling numpy-1.22.4:
Successfully uninstalled numpy-1.22.4
Successfully installed numpy-2.0.2
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python
Python 3.9.23 (main, Jun 5 2025, 13:40:20)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module>
from pandas.compat.numpy import (
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
from pandas.util.version import Version
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module>
from pandas.util._decorators import ( # noqa:F401
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module>
from pandas._libs.properties import cache_readonly
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module>
from pandas._libs.interval import Interval
File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158584
Approved by: https://github.com/huydhn
- Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with
```
/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
```
while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')`
- Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet)
- Adjust `test_index` to avoid using DoubleTensors for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158582
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to an important file in dynamo, `decorators.py`
NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time
Running
```
mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 209 | 908 | 23.02% | 9 | 39 | 23.08% |
| This PR | 870 | 943 | 100.00% | 36 | 39 | 100.00% |
| Delta | +661 | +35 | +76.98% | +27 | 0 | +76.92% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158509
Approved by: https://github.com/williamwen42
Summary:
The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h`
This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h`
The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py`
Test Plan:
CI
Rollback Plan:
Differential Revision: D78378373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158400
Approved by: https://github.com/jingsh, https://github.com/desertfire
Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158548
Approved by: https://github.com/janeyx99
This modifies the lint workflow to use the new get-changed-files
workflow to optimize lint execution by only running on files
that have actually changed in pull requests.
This more closely mirrors the type of behavior that users
expect when running lint locally on their PRs.
This also leaves the default behavior as a fallback for when
you're not running on a pull request.
Since lint runs on the pull_request event I'm not really worried about
any type of ciflow shenanigans in this.
This also splits mypy into its own job since mypy needs to run on all-files all the time.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158518
Approved by: https://github.com/huydhn
ghstack dependencies: #158517
Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.
Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458
Approved by: https://github.com/jansel
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.
Test Plan:
ensure tests pass
Rollback Plan:
Differential Revision: D78366701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton.
Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459
Approved by: https://github.com/atalman
### Description
This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.
```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```
Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
Summary:
In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0).
tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index.
TBEPermute is relying on this behavior to route output tensor to a device specified by current thread.
For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement.
Device-to-device mapping must be done between concrete device!
Test Plan:
CI
Rollback Plan:
Differential Revision: D78443109
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489
Approved by: https://github.com/henryoier
Summary: AOTI already has weights embedded in .so file. So for the initial load, no need to load the weights again. This allows lowered modules can have different set of weights on different hardwares.
Test Plan:
```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=895279202
SNAPSHOT_ID=0
MODULE=merge
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName ${MODULE} --predictor-hardware-type 1 --submodToDevice "" --benchmarkDontRebatchSamples=true --benchmarkNumIterations 1000
```
Rollback Plan:
Differential Revision: D78383881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158416
Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad
This is related to: https://www.anaconda.com/legal/terms/terms-of-service
Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799
Rocm and XPU builds since they use Miniforge are not affected
```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
> [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93 • https://repo.anaconda.com/pkgs/main
11.93 • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93 ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.
Using solution 2.
Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``
4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).
Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
When I modify the code located in test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg,
some unrelated format error occurred.
```Python
Lint for torch/_inductor/fx_passes/fuse_attention.py:
Error (CODESPELL) spelling error
Failed due to ValueError:
/pytorch/pytorch/torch/_inductor/fx_passes/fuse_attention.py:587: differnt
==> different
Please either fix the error or add the word(s) to the dictionary file.
HINT: all-lowercase words in the dictionary can cover all case variations.
Lint for torch/fx/traceback.py:
Error (MYPY) [assignment]
Incompatible types in assignment (expression has type "str", variable has
type "None")
101 |
102 | def _get_action_string(self):
103 | if self._action_string is None:
104 | self._action_string = "+".join([a.name.lower() for a in self.action])
105 | return self._action_string
106 |
107 | def print_readable(self, indent=0):
Error (MYPY) [assignment]
Incompatible types in assignment (expression has type "dict[str, Any]",
variable has type "None")
121 | if self._dict is None:
122 | # Convert the object to a dictionary
123 | action_string = self._get_action_string()
124 | self._dict = {
125 | "name": self.name,
126 | "target": self.target,
127 | "graph_id": self.graph_id,
Error (MYPY) [return-value]
Incompatible return value type (got "None", expected "dict[Any, Any]")
130 | "from_node": [node.to_dict() for node in self.from_node],
131 | }
132 |
133 | return self._dict
134 |
135 | def __eq__(self, other: object):
136 | if not isinstance(other, NodeSource):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158450
Approved by: https://github.com/Skylion007
This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency.
This help to migrate existed backend to PrivateUse1 smoothly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609
Approved by: https://github.com/jgong5, https://github.com/albanD
If there is only one node passed to aten::cat, the argument is a single node,
rather than a list of nodes with a valid length.
Example stack
```
File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/pattern_matcher.py", line 1115, in apply
self.handler(match, *match.args, **match.kwargs)
File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/fx_passes/split_cat.py", line 1786, in merge_split_cat_aten
if len(cat_inputs) < threshold_to_cat:
torch._inductor.exc.InductorError: TypeError: object of type 'Node' has no len()
```
This has failed about 7 internal jobs in the last week, running pytorch trunk code from 06/15
I've attached a test which reproduces this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157155
Approved by: https://github.com/jansel
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`
- change the guard flag from `trace.enabled` to `trace.provenance_tracking`. It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.
In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.
See more motivation in internal Diff
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```
Differential Revision: D78287976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
These change add an `eval()` API to PP schedules
## Context
Currently, you can run "Forward only" for a schedule in two ways:
1. Use a custom schedule `_ScheduleForwardOnly`
2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed.
However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this.
```python
if self.rank == 0:
schedule.eval(x)
elif self.rank == self.world_size - 1:
losses = []
schedule.eval(target=target, losses=losses)
else:
schedule.eval()
```
TODO:
- in later PRs, we will deprecate the `_ScheduleForwardOnly`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795
Approved by: https://github.com/wconstab
Related to https://github.com/pytorch/pytorch/issues/157517
Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.
Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
Found <GPU Name> which is of cuda capability 5.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is 7.0.
warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
Please see https://github.com/pytorch/pytorch/issues/157517
Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301
Approved by: https://github.com/nWEIdia, https://github.com/albanD
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html
Changes:
* Rename draft_export.md -> export.draft_export.md for consistency.
* Removed non-strict section in export, instead pointed to programming model doc.
* Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs.
* Removed Specialization section in favor of programming model doc
* Added pt2 archive doc
* Cleaned up sidebar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750
Approved by: https://github.com/pianpwk
It causes errors under C++20
```
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:330:40:
error: call to consteval function 'fmt::fstring<>::fstring<std::string, 0>' is not a constant expression
```
Indeed the printed value is treated as format string and it may contain special chars in some cases. While this is not true in our case, it can't be determined in compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158436
Approved by: https://github.com/Skylion007
Fixes#71673
This fixes a bug in PyTorch indexing, that shows up when mixing multi-dimensional boolean masks with other forms of indexing. Examples:
```python
>>> import torch
>>> x = torch.ones([2, 2, 3])
>>> m = torch.tensor(((True, False), (False, False))) # (2x2 boolean mask)
>>> x[m].shape # this works fine (the boolean mask acts on the 2x2 subspace selecting one row)
torch.Size([1, 3])
>>> x[m, 0] # this should produce a tensor of shape (1,)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 3] at index 1
>>> x[m, ::2] # this should produce a tensor of shape (1, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 3] at index 1
>>> x[m, None] # this should produce a tensor of shape (1, 1, 3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 2, 3] at index 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158369
Approved by: https://github.com/ngimel
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?
Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
does not match expected value
expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
Each download failed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
`libnvshmem_extension.so` creates an illusion that it is a shared library from NVSHMEM. But indeed it is built from torch source code, for symmetric tensor infrastructure and operations, though leveraging NVSHMEM APIs. Thus this PR renames `libnvshmem_extension.so` to `libtorch_nvshmem.so`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158234
Approved by: https://github.com/albanD
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
As popularly requested in user groups.
Test plan:
```
import torch
a = torch.randn(10000)
device = torch.device('cuda:1')
a = a.to(device)
```
Before:
```
Traceback (most recent call last):
File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
a = a.to(device)
^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
After:
```
Traceback (most recent call last):
File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
a = a.to(device)
^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395
Approved by: https://github.com/aorenste
Co-authored-by: Aaron Orenstein <aorenste@fb.com>
This is related to: https://www.anaconda.com/legal/terms/terms-of-service
Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799
Rocm and XPU builds since they use Miniforge are not affected
```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
> [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93 • https://repo.anaconda.com/pkgs/main
11.93 • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93 ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.
Using solution 2.
Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``
4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Fixes#141563
In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case.
This difference in behavior leads to a bug when using torch.compile:
```python
>>> import numpy as np
>>> f = lambda x: x[:,(0,1),...,(0,1)].shape
>>> a = np.ones((3, 4, 5))
>>> f(a)
(2, 3)
>>> torch.compile(f)(a)
(3, 2)
```
Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified.
Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples:
- #71673
- #107699
- #158125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297
Approved by: https://github.com/soumith
**Background**:
```Shell
torch 2.5.1+cpu
torchvision 0.20.1
```
```Python
import torch
import torchvision
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```
**Cause**:
```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen.
I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy.
I will leave it to pytorch developers to see if the improvement is worthwhile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320
Approved by: https://github.com/ezyang
This PR adds ROCm 7.0 alpha docker builds to start testing latest ROCm in PyTorch CI and enable new MI350x hardware.
Highlights:
* Stop building `pytorch-linux-jammy-rocm-n-1-py3` docker images, as they're not currently used in any CI workflows
* Add `pytorch-linux-noble-rocm-alpha-py3` docker images that will use ROCm alpha (newer than latest official release) builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158390
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Hi team,
Please help review this trivial fix.
Without this change:
``` python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None
capture_overload_names (bool) : whether to include ATen overload names in the profile
```
With this change:
```python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None
An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed.
profiler_metrics : a list of CUPTI profiler metrics used
to measure GPU performance events.
If this list contains values Kineto runs in CUPTI profiler mode
profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
or for the entire measurement duration.
verbose (bool) : whether the trace file has `Call stack` field or not.
performance_events : a list of profiler events to be used for measurement.
enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events
that expose CUDA device, stream and event synchronization activities. This feature is new
and currently disabled by default.
adjust_profiler_step (bool) : whether to adjust the profiler step to
match the parent python event duration. This feature is new and currently disabled by default.
disable_external_correlation (bool) : whether to disable external correlation
profile_all_threads (bool) : whether to profile all threads
capture_overload_names (bool) : whether to include ATen overload names in the profile
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586
Approved by: https://github.com/sraikund16, https://github.com/cyyever
The starting point for this refactor is that I need access to the fully
general joint graph representation in an export-like interface, but I
then subsequently need a way to feed this joint graph into the rest of
the compilation pipeline so I can get an actual callable that I can run
once I've finished modifying it. Previously, people had added export
capabilities to AOTAutograd by having an export flag that toggled what
exactly the functions return and triggering aot_dispatch to go to a
different "export" implementation, but I've found this difficult to
understand and has lead to a bit of duplicate code for the export path.
So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work:
* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call:
* create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls:
* One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which:
* Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture
* Do some base/export/autograd specific post-processing on the graph
Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to *resume* the rest of the post-processing to actually get a callable.
Here is the new structure:
* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow:
* Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure
* Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this.
* Call aot_stage1_graph_capture to do the graph capture
* Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want
With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176
Two main things of note:
- Review this diff without whitespace changes
- To ensure that context managers correctly propagate to later pipeline
stages, I am using the ExitStack trick: there is an ExitStack which is
in scope for the entire pipeline, and inside of the individual
pipeline stages we push context managers onto this stack when we want
them to survive into the next pipeline stage. This is not obviously
what the best final form of the code is, but
create_aot_dispatcher_function is called from multiple locations so I
can't just inline the context managers into the call site.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149, #158150
Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf.
Test Plan:
ci
Rollback Plan:
Reviewed By: yushangdi
Differential Revision: D78298501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372
Approved by: https://github.com/yushangdi
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
x = torch.rand(N, N)
total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 = torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.
[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details
**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.
**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.
<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
Before the PR, for code like this:
```
class Example2(torch.nn.Module):
def forward(self, x, trigger, target):
return torch.cond(
trigger == 1,
lambda: x + target,
lambda: x * target,
(),
)
m = Example2()
x = torch.randn(2)
trigger = 0
target = 2
args = (x, trigger, target)
ep = torch.export.export(
m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC)
)
```
dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph.
The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing:
1. it's not in the _can_lift_attrs_to_input list (the change in builder.py)
2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py)
Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them.
Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
Summary:
Adds a unit test to verify that when 'user_managed=True' is passed to 'update_constant_buffer', the compiled AOTI model properly shares parameter storage with the eager model.
The test specifically covers the following:
1. Passes model weights to the AOTI model with 'user_managed=True''.
2. Updates the eager model weights using 'load_state_dict()', which performs in-place
3. Asserts that the compiled AOTI model reflects the updated weights, confirming shared memory behavior.
Fixes: #157474
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157496
Approved by: https://github.com/desertfire
Use ```brew install --cask miniconda``` as specified by https://formulae.brew.sh/cask/miniconda
Forward fix After: https://github.com/pytorch/pytorch/pull/156898#issuecomment-3074207175
Seeing in CI:
```
Run if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
==> Caveats
Please run the following to setup your shell:
conda init "$(basename "${SHELL}")"
Alternatively, manually add the following to your shell init:
eval "$(conda "shell.$(basename "${SHELL}")" hook)"
==> Downloading https://repo.anaconda.com/miniconda/Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
Already downloaded: /Users/ec2-user/Library/Caches/Homebrew/downloads/2e356e8b147647692e4da77ce4c0c14eefee65ec86f29cc7e8c21a26ac9397ca--Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
==> Installing Cask miniconda
==> Running installer script 'Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh'
PREFIX=/opt/homebrew/Caskroom/miniconda/base
Unpacking payload ...
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
Installing base environment...
Preparing transaction: ...working... done
Executing transaction: ...working...
done
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
installation finished.
==> Linking Binary 'conda' to '/opt/homebrew/bin/conda'
🍺 miniconda was successfully installed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158347
Approved by: https://github.com/seemethere
Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld
```
To:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```
This PR changes to strings like:
```
-Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```
Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270
Approved by: https://github.com/desertfire
Summary: As above, also changes a bunch of the build files to be better
Test Plan:
internal and external CI
did run buck2 build fbcode//caffe2:torch and it succeeded
Rollback Plan:
Reviewed By: swolchok
Differential Revision: D78016591
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
The general context for the upcoming stack of commits is I am attempting
to "pipeline" AOTAutograd. Instead of having function f call function g
which is the next "stage" of compilation, instead f should return with
its outputs, which are then piped to g for the next stage. This will
make it easier to implement early exit / resume pipeline without forcing
callback structure, which is good for export-style use cases. It also
reduces the size of our stack traces, which makes tools like Perfetto
happy.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149
Approved by: https://github.com/jamesjwu
This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above.
Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ).
We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12`
Fixes#157626
fixes these tests failures when pytorch is built in GCC12 and above
```
test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117
Approved by: https://github.com/malfet
Fixes#124435
This updates the torch.histogramdd documentation to correctly state that bins are inclusive of their left edges, not exclusive as currently written. There was a previous PR addressing this but it was closed due to inactivity. This picks that up and applies the fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158275
Approved by: https://github.com/albanD
Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time.
Differential Revision: D77915987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949
Approved by: https://github.com/desertfire
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
----
# Refactor and Improve the OpenReg Module
## Background
Since PrivateUse1 has become the main path for integrating new devices with PyTorch, there have been some feature requests related to PrivateUse1 regarding interfaces, documentation, reference examples, etc., such as the following:
- https://github.com/pytorch/pytorch/issues/155864
- https://github.com/pytorch/pytorch/issues/144955
- https://github.com/pytorch/pytorch/issues/144845
Taking these requests into consideration and combining them with the position of OpenReg, which is currently used as the test backend for PrivateUse1, I'm planning to make the following optimizations:
- Optimize the implementation of OpenReg to make it align with the standard specifications for real backend (C++) access, serving as a reference for new device integration code.
- Add comprehensive documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html) to guide new accelerator integration, functioning as a reference manual.
## Design Principles:
- Minimization Principle: Keep the code small and clear; only implement the minimum set of code required for verification and as an integration reference.
- Authenticity Principle: Integrate OpenReg in the same way that real accelerators access PyTorch.
## More Infos:
Pleaes refer to [this](6b8020f1ab/test/cpp_extensions/open_registration_extension/torch_openreg/README.md) for more information about `OpenReg`.
## Current Progress:
- Refer to the implementation of [torch_xla](https://github.com/pytorch/xla) to refactor all of OpenReg's code, making it easier to understand.
- Ensure all tests in [test/test_openreg.py](https://github.com/FFFrog/pytorch/blob/openreg/test/test_openreg.py) pass after refactoring.
## Next Steps:
- Add more features to cover all integration points.
- Gradually add user guides and documentation to the [developer notes](https://docs.pytorch.org/docs/main/notes.html).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158090
Approved by: https://github.com/seemethere, https://github.com/albanD
The `test_triton_wait_until` test was hanging due to an NCCL synchronization issue stemming from mismatched NVSHMEM operations. Specifically, the flag variable was updated using `nvshmemx_signal_op` (a signaling operation), but waited on with `nvshmem_wait_until` (intended for put/get updates). Per NVSHMEM documentation (see documentation reference section below), signal-updated variables require `nvshmem_signal_wait_until` for proper completion guarantees, so the mismatch caused a deadlock and NCCL hang.
**Fix:**
- A simple fix was to replace the flag update with a regular `nvshmem_putmem_block` (via `put_kernel`) to match `nvshmem_wait_until`. I also added a fence (`nvshmem_fence`) between data and flag puts on the sender (Rank 1) for ordered delivery.
- In a follow-up PR I will add a kernel/test to demonstrate usage of `nvshmemx_signal_op`
**Testing:**
- I ran `python test/distributed/test_nvshmem_triton.py` and `python test/distributed/test_nvshmem_triton.py -k test_triton_wait_until`
- I also verified with debug prints (Sender completes puts/fence before receiver's wait returns, and assertions confirm correct state). Multiple runs show no hangs or failures.
**Documentation Referenced:**
- [NVSHMEM Point-To-Point Synchronization](https://docs.nvidia.com/nvshmem/api/gen/api/sync.html) explicitly states: *"the sig_addr object at the calling PE is expected only to be updated as a signal, through the signaling operations available in Section NVSHMEM_PUT_SIGNAL and Section NVSHMEM_PUT_SIGNAL_NBI"*
- [NVIDIA's Official Ring Broadcast Example](https://docs.nvidia.com/nvshmem/api/examples.html) demonstrates the correct pairing: `nvshmemx_signal_op` with `nvshmem_signal_wait_until` (not `nvshmem_wait_until`)
- [NVSHMEM Signaling Operations](https://docs.nvidia.com/nvshmem/api/gen/api/signal.html) documents that signal operations work on special "signal data objects" with specific atomicity guarantees distinct from regular RMA operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158167
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
Beginning of process for 3.14 bringup.
State of things from this PR:
- Nothing too scary looking from the Dynamo CPython side, nothing we heavily rely on seems to be missing @williamwen42
- The existing check that makes torch.compile() nicely fail is working as expected. So all these empty functions shouldn't cause any weirdness.
- The `__module__` update changes look suspicious, we should investigate what is the reason and impact of that, in particular for our public API checking @jbschlosser
- Leaving the weakref.py thread safety change as a follow up to keep this a bit simpler. I vendored the whole struct in the meantime FYI @ezyang
EDIT: The `__module__` change is even more cursed than I though due to changes to Union and Optional type where the `__module__` field cannot be changed anymore. See https://github.com/python/cpython/issues/132139 for details.
For now, I'm just skipping the `__module__` setting for 3.14 which will trip the public API checks. Will revisit once I have a final answer on the cpython issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158184
Approved by: https://github.com/msaroufim
**Summary**
`split_strategy` used `TupleStrategy` as return type because DTensor sharding
propagation's `OpStrategy` support on multi-returns only applies to `Tuple`.
However, `TupleStrategy`'s not a good fit for `split` op. `TupleStrategy` was
initially introduced to handle the sharding strategy of `foreach_*` ops where
the input args can be split into independent subsets regarding sharding decisions,
so are the outputs.
To address the misuse, this PR adds `OpStrategy` propagation for `List[Tensor]`
(note that this support is INCOMPLETE because it only checks the return type
to be `torch.ListType`). Nevertheless, the logic for `Tuple` returns also made
similar assumption so I think it's fine to unblock in such a way.
Besides adding `OpStrategy` support to ops having `List[Tensor]` return type,
this PR also changes `split_strategy`'s return from `TupleStrategy` to `OpStrategy`.
**Test**
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158051
Approved by: https://github.com/wconstab, https://github.com/zpcore
local_tensor input to grouped_mm has a stride requirement.
(see `_meta_grouped_mm_common` in meta_registrations.py or
`check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp)
Don't allow sharding a tensor if its shape would result in an
incompatible local_tensor stride.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158245
Approved by: https://github.com/zpcore, https://github.com/XilunWu
This PR allows for symints in `gen_slice_strategy` which is the strategy for `aten.slice.Tensor`. Previously, using dynamic shapes with slicing would result in
```
File ".../pytorch/torch/distributed/tensor/_ops/_tensor_ops.py", line 348, in gen_slice_strategy
assert isinstance(end, int)
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(DTensor(local_tensor=FakeTensor(..., device='cuda:0', size=(s3, 2)), device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Shard(dim=0),)), slice(None, (s77//2), None)), **{}): got AssertionError()
```
Questions before merge:
1. `dim` is still asserted to be int. Is this fine, or is this potentially dynamic as well?
2. I'm using argtype ignore for `normalize_dim`. Should I instead change types for `normalize_dim` and further dependency to be `IntLike` as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157953
Approved by: https://github.com/wconstab
When loading a package and calling package.install(backends), we create a new frame and compile id for each package load, so that tlparse and chromium events still show compile times on warm start.
There is an argument for not doing this in AOT precompile, as no "compile" occurs. So for now, we put it in `package.install`, which hopefully won't be a thing for AOT precompile.
## Recompiles
Recompiles get saved to the same frame and code entry, so on warm start, each recompile will get collapsed into the same entry. Therefore, dynamo compiles that have recompiles on cold start (0/0, 0/1, 0/2, etc) will all get collapsed into a single compile id (0/0), as warm start will load all of the entries properly.
## Graph breaks
Graph breaks get their own compile id, and therefore their own code entry. These are replicated on warm start, so if cold start you had 4 different graphs (and therefore 4 compile ids), you'll have 4 compile ids on warm start as well.
## Test plan
Added a frame counter check to existing unit tests for automatic dynamic, showing that old and new frame counter between old and new load is the same.
This is the chromium event for test_automatic_dynamo_graph_breaks_device_cuda:
```
python test/dynamo/test_package.py -k test_automatic_dynamo_graph_breaks_device_cuda
```
<img width="2216" height="508" alt="image" src="https://github.com/user-attachments/assets/f604ed33-5c31-464b-9320-d67b2e6f57a1" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158028
Approved by: https://github.com/oulgen
This is intended to make it easier to have backend specific "hints" that can be provided by the user to hint about certain options.
```py
import torch.distributed._dist2 as dist2
pg = dist2.new_group(backend="my_custom_backend", device=..., timeout=..., foo=1234, bar="1234")
pg.allreduce(...)
```
Test plan:
```
pytest test/distributed/test_dist2.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158147
Approved by: https://github.com/fduwjj
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
x = torch.rand(N, N)
total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 = torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.
[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details
**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.
**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.
<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
During enabling inductor CI in Windows, `test_torchinductor_opinfo.py` cost too many time (about 12 hours). This UT was seriously exceeding the time limit of CI. The compiler building was slower 4x in Windows than Linux after analyzing.
Thus, we decide to skip the UT temporary and @xuhancn will keep searching the solution of compiler building in Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158225
Approved by: https://github.com/jansel
Co-authored-by: Xu Han <xu.han@outlook.com>
Fixes#156707
Detect if all values along the softmax axis are infs and overwrite the outputs for those computations with zeros before the final matmul. The behavior should be aligned with the CPU implementation.
These types of cases where all values along the dimension in the attention mask are false leading to the undefined outputs in softmax occur with left padded batches for generation in HF transformers according to the original issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157727
Approved by: https://github.com/malfet
Summary:
In many investigations relating to invalid feature values, the three-argument form of `repeat_interleave` currently prints the following message if there is an inconsistency between `sum(repeats)` and `output_size`:
```
Assertion `result_size == cumsum_ptr[size - 1]` failed.
```
This is a bit hard for model authors to understand so I made the error slightly more comprehensible. After the fix the stdout contains the actual values of these parameters: https://fburl.com/mlhub/cfyyhh3q
```
Invalid input! In `repeat_interleave`, the `output_size` argument (949487) must be the same as the sum of the elements in the `repeats` tensor (949687).
```
In many cases, this is potentially useful information since we know for example that the difference between the two values above (949687-949487=200) happens to be the lengths of one of the features.
## What are my concerns with this change?
1. Outputs from `__assert_fail` go to `stderr` whereas `printf` writes to `stdout`. This is not the usual debugging flow where all logs can be found in `stderr`. I could not find a way to redirect `printf` to stderr or `__assert_fail` to stdout
2. Two checks happen instead of one in the error path. I wanted to preserve the semantics of what happens inside `__assert_fail`.
3. I have not seen this pattern in other PyTorch kernels but `repeat_interleave` with three arguments seems special in other ways too.
Test Plan:
* Built an ephemeral package with my changes:
https://www.internalfb.com/intern/servicelab/build/736441058/
* Verified that a job with these changes indeed prints out the expected message to stdout: https://fburl.com/mlhub/jgbqk8eg
* I will export to GH and run CI/CD tests.
Rollback Plan:
steps:
- manual.note:
content: >-
Just reverting this diff should be sufficient. Since this change is in
CUDA kernels, I do not believe there is a way to change the error
message via a JK.
Reviewed By: mradmila
Differential Revision: D77904753
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157996
Approved by: https://github.com/ngimel, https://github.com/eqy
This PR adds a pass to sanitize_gm_for_cache which normalizes all placeholder names across input dynamo graphs to AOTAutogradCache. This is safe because nothing underneath AOTAutograd uses the node names on the
original dynamo graph: AOTAutograd re-traces with its own nodes, and guards are
in terms of original sources rather than placeholder names.
Note that the dynamo output graphs traced by tlparse will not show this change because it's done before this sanitization step. The aot autograd outputs also will not change because AOTAutograd's own traced graphs don't use the original placeholders of the dynamo graph. Thus, this change is essentially a no-op from everyone's perspective except for cache key checks.
Fixes#157792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157916
Approved by: https://github.com/zou3519
Before, if NVSHMEM is installed at *BOTH* system location (e.g. `/usr/local`) and conda location (e.g. `/path/to/conda/lib/python3.10/site-packages/nvidia/nvshmem`, there can be a mismatch in where host lib and device lib are found:
```
-- NVSHMEM_HOME set to: ''
-- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB: '/usr/local/lib/libnvshmem_host.so'
-- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
The reason is that CMake prioritize name search over dir search. In the script below, CMake will search all locations for `libnvshmem_host.so` first, before it searches for `.so.3`.
```
find_library(NVSHMEM_HOST_LIB
# In pip install case, the lib suffix is `.so.3` instead of `.so`
NAMES nvshmem_host nvshmem_host.so.3
HINTS $ENV{NVSHMEM_HOME} ${NVSHMEM_PY_DIR}
PATH_SUFFIXES lib lib64 cuda/lib cuda/lib64 lib/x64)
```
This PR adds the `NAMES_PER_DIR` flag, according to CMake's doc:
> The NAMES_PER_DIR option tells this command to consider one directory at a time and search for all names in it.
After this PR:
```
-- NVSHMEM_HOME set to: ''
-- NVSHMEM wheel installed at: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem'
-- NVSHMEM_HOST_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_host.so.3'
-- NVSHMEM_DEVICE_LIB: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/lib/libnvshmem_device.a'
-- NVSHMEM_INCLUDE_DIR: '.conda/envs/pytorch-3.10/lib/python3.10/site-packages/nvidia/nvshmem/include'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157836
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #157513, #157695
Re-raising of #129959 as that was closed.
Warning message before:
```
/home/admin/.local/share/hatch/env/virtual/toms-project-1/Qv9k_r_5/dev/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
```
Warning message after:
```
/path/to/my/code:91: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
```
Helps the user find where the issue stems from in their code. What do you think?
(Looks like "skip_file_prefixes" is not available until Python 3.12 minimum...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155112
Approved by: https://github.com/Skylion007, https://github.com/cyyever
That fixes `index_put(..., accumulate=True)` for all dtypes
int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064, #158178
Fixes#156012
This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.
After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily, https://github.com/XilunWu
Fixes#157720
### What's in this PR?
This PR improves the error handling in `torch.compile` for `ndarray.astype('O')` (or `object`). It now explicitly raises a `torch._dynamo.exc.Unsupported` exception with a clear explanation, instead of failing with a less intuitive error during fake tensor propagation.
This is achieved by adding a check within `NumpyNdarrayVariable.call_method` for this specific `astype` pattern.
A new test, `test_ndarray_astype_object_graph_break`, is also added to `test/test_numpy_interop.py` to verify this new behavior.
### Background
Previously, attempting to `torch.compile` a function containing `ndarray.astype('O')` would result in a `TorchRuntimeError` wrapping a `TypeError: data type 'O' not understood`. This error message, originating deep within the tensor mechanism, was not very user-friendly and didn't clearly state *why* it was unsupported.
This change makes the failure more explicit and provides a better user experience by giving a direct, actionable error message.
**Old Behavior (Error Traceback):**
```
torch.dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: ... got TypeError("data type 'O' not understood")
```
**New Behavior (Error Message):**
```
torch.dynamo.exc.Unsupported: ndarray.astype(object)
Explanation: ndarray.astype('O') or ndarray.astype(object) is not supported by torch.compile, as there is no equivalent to object type in torch.
```
### Testing
A new test has been added to `test_numpy_interop.py` which decorates a function containing `ndarray.astype("O")` with `torch.compile`. The test asserts that a `torch._dynamo.exc.Unsupported` exception is raised, confirming the new error handling works as expected.
The test can be run with:
`pytest test/test_numpy_interop.py -k test_ndarray_astype_object_graph_break`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157810
Approved by: https://github.com/jansel
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
Summary: This diff makes changes to the USDT added by RihamSelim in D44636587. The "operator_start" USDT passes in the memory addresses of operator arguments and the argument types. This is so we can record argument values and types in the Strobelight GPUEvent Profiler. The previous diff records the ATEN operator, and this diff lays the groundwork to record ATEN op arguments.
Test Plan: I ensured this code builds by running the example in this diff, and testing profiler changes in this diff.
Reviewed By: RihamSelim
Differential Revision: D75606556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155185
Approved by: https://github.com/malfet
From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes.
But it still raise clang ASAN `stack alloction`, when build large size consts.
This PR:
1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list.
2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175
Approved by: https://github.com/jansel
Now instead of erroring out on `empty_cache` call during graph capture or under mempool context, we will just silently do nothing. This used to be the behavior for mempools, cudagraphs used to error out, but it's fine to just ignore the call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158152
Approved by: https://github.com/zou3519, https://github.com/eqy
Written with Claude Code.
Fixes https://github.com/pytorch/pytorch/issues/157569
Fixes https://github.com/pytorch/pytorch/issues/158134
NumPy and PyTorch handle advanced indexing differently when advanced indices are separated by slices (e.g., arr[:, [0], :, 0]). PyTorch uses "outer" indexing placing result dimensions in original positions, while NumPy uses "vectorized"
indexing moving advanced index dimensions to the front.
This adds _numpy_style_advanced_indexing() to detect separated advanced indices and transpose results to match NumPy's dimension ordering, ensuring torch._numpy maintains compatibility with NumPy's indexing behavior.
Fixes cases like:
- arr[:, [0], :, 0] now returns shape (1, 5, 7) instead of (5, 1, 7)
- arr[:, [0, 1], :, 0] now returns shape (2, 5, 7) instead of (5, 2, 7)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157676
Approved by: https://github.com/manuelcandales
Co-authored-by: Claude <noreply@anthropic.com>
Fixes#157973
`THPUtils_unpackNumberAsBool` now recognises `numpy.bool_ scalars` explicitly (using `torch::utils::is_numpy_bool`).
If the object is a NumPy boolean, we retrieve its truth value via `PyObject_IsTrue` and return it, avoiding the previous failing path that attempted to treat it as an integer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158036
Approved by: https://github.com/jansel
Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.
Test Plan:
ensure existing tests pass
Rollback Plan:
Differential Revision: D78108144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069
Approved by: https://github.com/teja-rao
# Context
In D75803582, we migrated relu/relu_ from out-of-tree to pytorch in-tree. With that, we also changed it to use the ATen op-layer logic:
https://www.internalfb.com/code/fbsource/[04ec3fcd0b09b601ae26a785e595ab960a6ba684]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520
To summarize:
**The behavior before D75803582:**
The Relu operator calls this code(https://fburl.com/code/pezspv40) and launches Relu kernel.
**The behavior after D75803582:**
The Relu operator uses the ATen logic, which delegates to the clamp_min operator, and no longer launch Relu kernel.
-----------------
But according to my discussion with @vvk, we should keep using the Relu kernel, instead of adopting ATen logic that delegates to clamp_min, because MTIA's Relu kernel has special optimization for MTIA device.
# This diff
Change relu / relu_ to launch relu kernel, which is same as the original behavior before D75803582.
Note: this doesn't mean to revert D75803582, because we still want to move relu/relu_ to in-tree.
Differential Revision: [D78109262](https://our.internmc.facebook.com/intern/diff/D78109262/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158101
Approved by: https://github.com/albanD
Most added ops are backwards ops, which have not been well-tested previously (thus why they were missed). Necessary ops were identified by manual examination of torch/_meta_registrations.py return values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158073
Approved by: https://github.com/desertfire
The idea of this PR is that, sometimes we are filtering ops based not based on the node specific information. For example, we always filter out simt ops. So I want to group them together into a global filtering function.
This can help shrink the config space as well. 20s -> 6s for instantiation 3332.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157866
Approved by: https://github.com/ColinPeppler
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.
* This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context.
* This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout.
Test plan:
```
pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator
Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------ -----------------------------------------------------------]
| 11x50x50 | 11x100x100 | 11x500x500 | 11x1000x1000 | 11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
__getitem__ (torch.int8, torch.int64) | 383.5 | 379.8 | 470.9 | 1232.9 | 4410.3
__getitem__ (torch.float16, torch.int64) | 379.6 | 354.5 | 533.2 | 1290.3 | 4442.2
__getitem__ (torch.float32, torch.int64) | 360.8 | 338.6 | 478.6 | 1348.9 | 4870.4
Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------ -----------------------------------------------------------]
| 11x50x50 | 11x100x100 | 11x500x500 | 11x1000x1000 | 11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
__getitem__ (torch.int8, torch.int64) | 349.8 | 330.5 | 432.6 | 764.5 | 1961.2
__getitem__ (torch.float16, torch.int64) | 342.5 | 330.7 | 434.7 | 741.0 | 1969.4
__getitem__ (torch.float32, torch.int64) | 332.2 | 326.1 | 445.4 | 751.3 | 1972.6
Times are in microseconds (us).
```
While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint
Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
x = torch.rand(N, N)
total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 = torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.
[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details
**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.
**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.
<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
This PR fixes #157354
It fixes the issue in 'cmake/public/cuda.cmake' where a diagnostic message incorrectly showed an empty CUDA version when 'FindCUDA' and header-reported versions differed.
The problem was caused by this line:
set(${cuda_version_from_findcuda} ${CUDA_VERSION_STRING})
This incorrectly used the value of cuda_version_from_findcuda as a variable name. As a result the version string wasn't assigned and the error message omitted the version. This has been corrected to:
set(cuda_version_from_findcuda ${CUDA_VERSION_STRING})
Now the diagnostic message properly displays the CUDA version reported by FindCUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157370
Approved by: https://github.com/soulitzer
Summary:
NumPy based tensor rebuilding from serialization has been deprecated by other backends (eg. [XLA](https://github.com/pytorch/pytorch/pull/137444)). The new flow has CPU storage being constructed with data from the file and then moved to the target backend device.
Furthermore, relying on numpy for serialization will fail loudly when torch.load flips weights_only.
Reviewed By: andyanwang
Differential Revision: D77843238
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157884
Approved by: https://github.com/albanD
Summary: Following implementation of the updated ATen Backend for mtia, and diffs enabling in tree view ops (D75266206, D75385411), we can remove custom logic from reducer to handle MTIA view operations.
Test Plan:
CI
Rollback Plan:
Reviewed By: egienvalue
Differential Revision: D77843212
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157882
Approved by: https://github.com/albanD, https://github.com/andyanwang
**Summary**
To enable use case where the input DTensor to `split` op has `Partial()` placement,
this PR treats `Partial()` in the same way with `Replicate()`. That means, `split` op
only unshards the `Shard(dim=x)` if `x == split_dim` and keep other placement
untouched.
**Test**
Added a new test because `test_dtensor_ops` doesn't test `Partial()` placement.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157991
Approved by: https://github.com/zpcore
Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available.
I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode).
I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups:


We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better.
I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157905
Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
**Summary**
Enable fp8 qconv on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qconv op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qconv op is not changed either.
So, the FP8 qconv shares the same op as INT8 qconv and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.
Note:
OneDNN does not support quantized fp8 convolution until v3.9 but the version used in PyTorch is v3.7.2. So, the op goes to the reference kernel for now. And we have also update the oneDNN path so that it's compatible with the fp8 dtype. Once oneDNN is upgraded to v3.9 or newer, minimum changes are needed to enable the oneDNN path. And we have ensured that the behavior of the reference kernel is the same as the new oneDNN's implementation.
- oneDNN version < 3.9 (now)
- Always go to the reference kernel
- oneDNN version >= 3.9 (future)
- Go to reference kernel on old platforms (without AMX)
- Use oneDNN on new platforms (with AMX)
**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qconv and fp8"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157076
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met.
Some implementation details:
- The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store
- The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met.
Testing:
- Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option.
This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906
Approved by: https://github.com/jansel
Summary: In Pytorch 2.5 we added source code attribution to PT2 traces. Each Torch-Compiled Region will now have its frame id and frame compile id associated with it. Update the image in the doc and add a description of this in the doc itself
Test Plan:
{F1980179183}
Rollback Plan:
Differential Revision: D78118228
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158066
Approved by: https://github.com/aaronenyeshi
Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157961
Approved by: https://github.com/masnesral
Fixes https://github.com/pytorch/pytorch/issues/154312
Fix logdet returning finite values for singular matrices on CUDA (https://github.com/pytorch/pytorch/issues/154312https://github.com/pytorch/pytorch/issues/154312)
PyTorch's logdet function returns mathematically incorrect finite values for
singular matrices on CUDA devices instead of the expected -inf. This occurs
because cuSOLVER and LAPACK produce tiny non-zero diagonal elements (~1e-16)
instead of exact zeros for singular matrices.
**Problem:**
Issue https://github.com/pytorch/pytorch/issues/154312 matrix returns finite values instead of -inf for singular matrices.
**Solution:**
Implemented NumPy-style two-tier singularity detection with GPU sync point removal:
1. **Primary detection**: Use LAPACK's built-in singularity detection via info parameter
2. **Backup detection**: Apply threshold-based detection for numerical edge cases
3. **Zero GPU sync points**: Eliminated all .item(), std::get<0>(), and scalar extractions
4. **Pure tensor operations**: All computations use tensor operations throughout
**Performance Impact:**
Based on comprehensive benchmarking across matrix sizes and data types:
- **Overall Impact**: 0.85× average speedup (+18.0% overhead)
- **CPU Performance**: 0.84× average speedup (+18.8% overhead)
- **CUDA Performance**: 0.85× average speedup (+17.3% overhead)
**Performance Trade-offs:**
- **Small matrices (16×16, 64×64)**: Higher overhead due to tensor operation setup costs
- **Large matrices (512×512, 2048×2048)**: Near-zero overhead, with some cases showing slight improvements
- **GPU sync elimination**: Removes expensive GPU→CPU synchronization bottlenecks
**Results:**
- ✅ All singular matrices now correctly return -inf on both CPU and CUDA
- ✅ Original issue https://github.com/pytorch/pytorch/issues/154312 matrix now works correctly
- ✅ Results match NumPy's slogdet behavior exactly
- ✅ Zero GPU synchronization points for improved performance
- ✅ Comprehensive edge case testing added
**Verification:**
Before: torch.linalg.slogdet(singular_matrix) → finite values (incorrect)
After: torch.linalg.slogdet(singular_matrix) → (sign=0, logabsdet=-inf) ✅
The implementation uses pure tensor operations to eliminate GPU sync points while
maintaining robust singularity detection through a two-tier approach.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157910
Approved by: https://github.com/lezcano, https://github.com/IvanYashchuk, https://github.com/albanD
Co-authored-by: Claude <noreply@anthropic.com>
Differential Revision: D78089705
Previously to support overriding autotune configs for post fusion kernels in Inductor with a lookup table, we only keyed on the source code. However, the same source code could have multiple optimal configs, due to the input sizes. With this, we have many collisions in our lookup table, leading to subpar configs. A way around this is to add the size_hints to the lookup key as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158026
Approved by: https://github.com/jansel
Summary:
Replaced ASSERT_FLOAT_EQ which defaults to fixed kMaxUlps ( = 4-ULP , See gtest-internal.h) with ASSERT_NEAR which lets us set epsilon to 1e-3, (approximately 3 ULPs). This allows for slightly stricter and tunable comparison.
Test Plan:
**Before Fix**
✗ Fail:
qnnpack:pytorch_qnnpack_testApple - FULLY_CONNECTED_SPARSE_OP_8x1/unit_batch_dynamic_prepacked (0.0s)
'Expected equality of these values:
output_dynamic[i * outputChannels() + c]
Which is: 9.9160004
accumulators_float[i * outputChannels() + c]
Which is: 9.9159956
at 0, 17: reference = 9.9159955978393555, optimized = 9.9160003662109375
------------------------------
**After Fix**
Everything passes
Rollback Plan:
Differential Revision: D77911682
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157861
Approved by: https://github.com/kimishpatel, https://github.com/lucylq, https://github.com/malfet
## Overview
This PR adds a kwarg to the `table()` method of the profiler allowing users to specify a time unit to be used for all results in the profiling table. The available options are: `s`, `ms` and `us`. If an invalid unit or no unit is provided, then a time unit is selected based on the size of the value (current default behaviour).
## Testing
A unit test has been added to verify this works correctly.
## Documentation
I couldn't find any documentation specific to the `table()` function beyond doc strings which have been updated.
## Example Output
```
import torch
from torch.profiler import profile
with profile() as prof:
res = torch.mm(torch.rand(1024, 1024), torch.rand(1024, 1024))
print(prof.key_averages().table(time_unit="s"))
print(prof.key_averages().table(time_unit="ms"))
print(prof.key_averages().table(time_unit="us"))
print(prof.key_averages().table())
```
```
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::rand 0.04% 0.000s 10.36% 0.014s 0.007s 2
aten::empty 0.04% 0.000s 0.04% 0.000s 0.000s 2
aten::uniform_ 10.27% 0.014s 10.27% 0.014s 0.007s 2
aten::mm 89.64% 0.119s 89.64% 0.119s 0.119s 1
aten::resolve_conj 0.00% 0.000s 0.00% 0.000s 0.000s 3
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 0.133s
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::rand 0.04% 0.055ms 10.36% 13.735ms 6.868ms 2
aten::empty 0.04% 0.054ms 0.04% 0.054ms 0.027ms 2
aten::uniform_ 10.27% 13.626ms 10.27% 13.626ms 6.813ms 2
aten::mm 89.64% 118.892ms 89.64% 118.896ms 118.896ms 1
aten::resolve_conj 0.00% 0.004ms 0.00% 0.004ms 0.001ms 3
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 132.631ms
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::rand 0.04% 55.495us 10.36% 13735.202us 6867.601us 2
aten::empty 0.04% 54.121us 0.04% 54.121us 27.061us 2
aten::uniform_ 10.27% 13625.586us 10.27% 13625.586us 6812.793us 2
aten::mm 89.64% 118892.284us 89.64% 118895.981us 118895.981us 1
aten::resolve_conj 0.00% 3.697us 0.00% 3.697us 1.232us 3
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 132631.183us
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::rand 0.04% 55.495us 10.36% 13.735ms 6.868ms 2
aten::empty 0.04% 54.121us 0.04% 54.121us 27.061us 2
aten::uniform_ 10.27% 13.626ms 10.27% 13.626ms 6.813ms 2
aten::mm 89.64% 118.892ms 89.64% 118.896ms 118.896ms 1
aten::resolve_conj 0.00% 3.697us 0.00% 3.697us 1.232us 3
---------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 132.631ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157913
Approved by: https://github.com/sraikund16
This adds new context manager based PG management to dist2. This allows for managing the active process group much in the same way as a stream
```py
with dist2.process_group(pg):
dist2.current_process_group().allreduce(...).wait()
```
matches
```py
with torch.cuda.stream(stream):
torch.cuda.current_stream().synchronize()
```
Test plan:
```
pytest test/distributed/test_dist2.py -k context
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157988
Approved by: https://github.com/fduwjj
DCE was incorrectly eliminating unused random operations like torch.rand() that have global RNG side effects, causing inconsistent results between eager and compiled execution modes.
**Root cause**: Python random functions (torch.rand, torch.randn, etc.) don't have the _nondeterministic_seeded attribute, so node.is_impure() returns False, allowing DCE to eliminate them despite advancing global RNG state.
**Solution**: Enhanced is_impure() in torch/fx/node.py to recognize Python random functions and mark them as impure when they use global RNG, regardless of the impure_random parameter setting. This ensures consistency between eager and compiled execution even when config.fallback_random=False.
**Key features**:
- Handles comprehensive list of random functions: rand, randn, randint, randperm, rand_like, randn_like, randint_like, normal, poisson, bernoulli, multinomial
- Generator optimization: Only marks as impure when using global RNG (no generator or generator=None). Operations with explicit generators don't affect global state and can be optimized.
- Works with both impure_random=True and impure_random=False cases
- Cleaner architecture: addresses root cause rather than working around it
**Tests**: Enhanced test_impure_random to verify both FX tracing and AOT compilation codepaths, ensuring random operations are preserved and eager/compiled execution consistency is maintained.
🤖 Generated with [Claude Code](https://claude.ai/code)
Fixes https://github.com/pytorch/pytorch/issues/151524
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157981
Approved by: https://github.com/mlazos
Co-authored-by: Claude <noreply@anthropic.com>
Summary:
We are testing enabling back autovectorization in some codepaths.
These resulted in crashes when compiling using clang17, we are now relying on clang19.
Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface
We are going to deploy it on ads workloads
Rollback Plan:
Differential Revision: D77448445
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157984
Approved by: https://github.com/Skylion007
Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal)
- Add codegen for static linkage
- refactor test code for test_compile_after_package tests
For now, the following options must be used together with `"aot_inductor.compile_standalone": True`.
"aot_inductor.package_cpp_only": True,
Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR.
```
python test/inductor/test_aot_inductor_package.py -k test_compile_after_package
python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129
Approved by: https://github.com/desertfire
We assumed that the output in an FX graph would always just be a
list[Tensor], even in the single tensor return case.
It is possible for the output to be a single Tensor. This can happen
by calling torch.fx.split_module on the module.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157803
Approved by: https://github.com/oulgen
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.
# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair
## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`
## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
Introduces support for a new `OVERRIDEABLE` backend in the SDPA module, improves backend selection logic, and adds corresponding tests. In addition, a fallback mechanism was added when a specific backend is unavailable, enhancing user configurability.
### Backend Support and Selection Enhancements:
* Added `at::SDPBackend::overrideable` to the list of available SDPA backends in the `Context` class (`aten/src/ATen/Context.h`).
* Updated the backend selection logic in `select_sdp_backend_xpu` to include the `OVERRIDEABLE` backend and added a fallback mechanism for unsupported `FLASH_ATTENTION` on XPU.
* Adjusted error messaging in `_fused_sdp_choice_xpu` to reflect the inclusion of the `OVERRIDEABLE` backend. (`aten/src/ATen/native/mkldnn/xpu/Attention.cpp`)
### Test Additions for Backend Fallback and Selection:
* Added new unit tests to validate fallback behavior for `FLASH_ATTENTION` to `OVERRIDEABLE` and to verify correct backend selection when `MATH` is enabled. (`test/test_transformers.py`,)
### Codebase Updates for Backend Integration:
* Introduced `OVERRIDEABLE` as a new member of the `_SDPBackend` enum. (`torch/_C/__init__.pyi.in`)
* Extended `_backend_names` and updated related methods to handle the `OVERRIDEABLE` backend. (`torch/nn/attention/__init__.py`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156669
Approved by: https://github.com/guangyey, https://github.com/drisspg
Summary:
`None` and `Ellipsis` in multi-dimensional indexing was previously not covered.
Moreover, we introduce a small optimization for `slice(None)` and a passthrough when symints do not appear in the indexing.
The remaining case is where indexing is by tensor, which is fairly complicated; we passthrough in that case.
Test Plan:
added tests
Rollback Plan:
Differential Revision: D77943929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157821
Approved by: https://github.com/pianpwk
Summary:
This DIFF is to fix the following issue:
In python source code for CompiledFxGraph,the FX graph segment for the Triton kernel is broken. For example, the following function
def fn(a, b, c):
x = torch.nn.functional.linear(a, b)
x = x.sin()
x = x.t() + c
return x
Inductor compiled this FX graph into two nodes: the first one is mm, the second one is a triton kernel for sin + transpose + add. The FX graph segment for the triton kernel is like the following:
Graph fragment:
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%permute_1, %arg2_1), kwargs = {})
Basically only "add" node in the FX graph.
The root cause is function caffe2/torch/_inductor/utils.py:gather_origins does not detect the realized node correctly.
To fix this issue, the IRNode is checked if it is one of the following IRNode:
ir.ComputedBuffer,
ir.InputsKernel,
ir.InputBuffer,
ir.ReinterpretView,
ir.TemplateBuffer,
If it is one of them, it is realized, otherwise, it is not.
Test Plan:
buck2 run mode/opt caffe2/test/inductor:provenance_tracing -- caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact.test_triton_kernel_to_post_grad_tracing_cuda
Rollback Plan:
Differential Revision: D77748371
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157578
Approved by: https://github.com/mlazos
@animesh pointed out using whitelist for strides can result in confusing graphs as follows
```
s60: "Sym(s60)", L_hidden_states_: "bf16[1, 4096, 3072][s60, 3072, 1]cuda:0"
```
We probably want to capture the relationship between sizes and strides anyways so let's make it so the whitelist only makes the sizes dynamic. That same graph now looks lik ethis
```
L_hidden_states_: "bf16[1, 4096, 64][262144, 64, 1]cuda:0"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157960
Approved by: https://github.com/pianpwk
Summary:
- adding mmap-ing
- more efficient writing in larger chunks
latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks
Test Plan:
ran consolidation with the following code:
```
from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files
import time
start_time = time.time()
consolidate_safetensors_files(base_path, consolidated_path)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
```
With the old code this was taking a couple minutes and this is now down to ~6s.
Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors
Rollback Plan:
Differential Revision: D77960054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936
Approved by: https://github.com/teja-rao, https://github.com/pradeepfn
If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written.
Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones.
Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371
Approved by: https://github.com/pradeepfn
Fixes#157195
### Summary:
Fixed Issue 157195 by adding a new error message for torch.binomial in **aten/src/ATen/native/Distributions.cpp**
### Explanation
According to the issue,
```
import torch
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]))
```
`RuntimeError: Found dtype Float but expected Long`
It looks like we are getting a Tensor error rather than a binomial function error. Since the error is coming from **pytorch/aten/src/ATen/TensorIterator.cpp**, it seems like it is trying to align the tensor data to the same datatype for smooth tensor computations instead of giving a binomial function error.
I tried using both arguments as longs and both as ints and got the right binomial function error
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long())
NotImplementedError: "binomial_cpu" not implemented for 'Long'
```
```
torch.binomial(torch.tensor([10.0]).int(), torch.tensor([0.5]).int())
NotImplementedError: "binomial_cpu" not implemented for 'Int'
```
But when I have both as different datatypes, the TensorIterator.cpp error comes back trying to align the datatypes.
`RuntimeError: Found dtype Float but expected Long`
I then tried finding where the NotImplementation Error was documented and found it in **pytorch/aten/src/ATen/Dispatch.h** in lines 193 - 211
```
#define AT_DISPATCH_SWITCH(TYPE, NAME, ...) \
[&] { \
const auto& the_type = TYPE; \
constexpr const char* at_dispatch_name = NAME; \
/* don't use TYPE again in case it is an expensive or side-effect op */ \
at::ScalarType _st = ::detail::scalar_type(the_type); \
RECORD_KERNEL_FUNCTION_DTYPE(at_dispatch_name, _st); \
switch (_st) { \
__VA_ARGS__ \
default: \
TORCH_CHECK_NOT_IMPLEMENTED( \
false, \
'"', \
at_dispatch_name, \
"\" not implemented for '", \
toString(_st), \
"'"); \
} \
}()
```
In the **AT_DISPATCH_SWITCH** function, it picks a tensor and its datatype and checks if the Tensor datatype matches the supported datatypes. If not we get the Not Implemented error. Unfortunately, I think the **AT_DISPATCH_SWITCH** function, uses the `common_dtype` from TensorIterator in order to run. So TensorIterator.cpp needs to happen before the AT_DISPATCH_SWITCH function.
### Summary: We are getting the wrong error message because **TensorIterator.cpp** gets called and errors out due to Tensor datatype mismatch before we can get the right error message in **Dispatch.h** for torch.binomial not supporting that datatype.
### Options for the Fix
**Option 1**: Make the error message in TensorIterator.cpp more general so it applies to torch.binomial. An error message along the lines
`RunTime Error : "Tensor Datatypes", op.target_dtype," and ", common_dtype_, "are different "`
**Option 2**: Add an error message for the binomial function datatype mismatch before the the TensorIterator.cpp error message gets called.
Although Option 1 seemed easier I think Option 2 might be better as it is more specific to the binomial function while Option1 would affect all Tensors with datatype mismatch.
**This PR applies the fix for Option 2**
After Fix :
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]))
RuntimeError: Binomial function arguments count and prob must have same datatype of type Float, got: count = Long, prob = Float
```
```
torch.binomial(torch.tensor([10]).long(), torch.tensor([0.5]).long())
NotImplementedError: "binomial_cpu" not implemented for 'Long'
```
@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157658
Approved by: https://github.com/soulitzer
Previous to this PR, torch.onnx.export(..., dynamo=True, veriy=True, report=True) does not support symbolic arguments. Such examples are like follwing:
```python
class M(torch.nn.Module):
def forward(self, a, x):
return a + torch.tensor(1) + x
op = torch.onnx.export(M(), (1, torch.ones(2)),
dynamic_shapes=(torch.export.Dim.DYNAMIC, {0: torch.export.Dim.DYNAMIC}),
dynamo=True, report=True)
```
symbolic arguments are like constant arguments that they don't have tensor_meta wither. Besides, torch.export.export supports model inputs having constants, which is different from the legacy issue: https://github.com/pytorch/pytorch/issues/99534 where we tried to get the FX directly from dynamo export. Thus, `_remove_non_tensor` is deleted from args processing.
NOTE: If the ConstantArugment shows up in exported_program, it was kept to align the length of inputs to nn.Module, but it's irrelevant to the model graph, hwich is why in ONNX model the input is omitted.
The test `test_constant_argument_user_input_is_omitted_in_onnx_graph` needs #157719
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157734
Approved by: https://github.com/justinchuby
Dynamo was aggressively specializing on lazy VTs over `set_name_hint` in
`STORE_FAST`, etc., and `isinstance` in `LOAD_FAST_CHECK`. This causes
regional `torch.compile` from optimizing ComfyUI GGUF + LoRA to either
(1). exceed the recompialtion limit of 8, which results in suboptimal
performance, and (2). even if recompilation limit is increased, the
compilation time gets unnecessarily high (180s v.s. 20s for Flux).
This patch fixes the recompilation issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156891
Approved by: https://github.com/williamwen42, https://github.com/mlazos
Summary:
Change AOTI_RUNTIME_DEVICE_CHECK to the following depending on device:
AOTI_RUNTIME_CUDA_CHECK
AOTI_RUNTIME_XPU_CHECK
AOTI_RUNTIME_CPU_CHECK
Currently in the codebase, only `AOTI_RUNTIME_CUDA_CHECK` is used.
This shouldn't change anything as of now, but we do this to prepare for simultaneouly loading multiple backends (e..g CPU and CUDA) in AOTI standalone.
We don't want people writing `AOTI_RUNTIME_DEVICE_CHECK` for both CPU and CUDA checks. This could cause compilation problems when we statically link both CPU and CUDA models.
Test Plan:
CI
Rollback Plan:
Reviewed By: muchulee8
Differential Revision: D77742977
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157818
Approved by: https://github.com/jingsh
## Issue being addressed
`is_sparse` presents itself as determining if a tensor is sparse. HOWEVER, it only does checks against the tensor for `sparse_coo`. This has lead to confusion from developers as when non-coo sparse tensors are provided it return false, despite those tensors being sparse.
## Considered Remedy
Fixing this is do-able however would result in complexity as existing systems may depend on this behavior remaining consistent, and even inside of pytorch is_sparse is used by `bform` which states that it supports only `sparse_csr and sparse_coo` meaning additional work/thought would have to go into solving for `sparse_csc` and `sparse_bsr`
## Remedy provided in this PR
In lieu of these complications the lowest risk highest gain action was to add clear warning messaging to the function for now to avoid confusion to developers utilizing the function. The rest of the function behavior remains identical
## Issue content
Addresses issue number: #101385
Original issue: https://github.com/pytorch/pytorch/issues/101385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157378
Approved by: https://github.com/soulitzer
Summary: We added the ability to make Annotating Global or Local based on an input flag in PyTorch but didn't add the args to the linter
Reviewed By: mzzchy
Differential Revision: D77959409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157858
Approved by: https://github.com/mzzchy
Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time.
Rollback Plan:
Differential Revision: D77899195
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773
Approved by: https://github.com/desertfire
This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324.
It does two things:
1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely.
2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
I was confused about why the distributed tests weren't showing up quickly on HUD, its because the call of run_tests.py for distributed didn't include upload artifacts while running flag, so set it to default to IS_CI so I don't need to put the flag everywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157868
Approved by: https://github.com/huydhn
Per title, as it fails with the following error if "+PTX" was used in `TORCH_CUDA_ARCH_LIST`:
```
File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in skip
has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/profiler/_pattern_matcher.py", line 313, in <genexpr>
has_tf32 = all(int(arch[3:]) >= 80 for arch in torch.cuda.get_arch_list())
^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'pute_120'
```
Because slicing `arch[3:]` will not end up on having only digits for `compute_120` element of `torch.cuda.get_arch_list()`:
```python
>>> torch.cuda.get_arch_list()
['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157711
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
**Background:**
```Shell
[1376/2332] Building CUDA object caffe2/CMakeFiles/torch_...h/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu.o
/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign
size_t numelIn_ = -1;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign
size_t numelOut_ = -1;
^
/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(450): warning #68-D: integer conversion resulted in a change of sign
size_t numelIn_ = -1;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/root/Git.d/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp(451): warning #68-D: integer conversion resulted in a change of sign
size_t numelOut_ = -1;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157889
Approved by: https://github.com/mlazos
The cpu inductor processes .to(torch.uint8) incorrectly, leading to numerical inconsistencies. The convert_float_to_int8 function may return incorrect results for negative inputs, such as -2.xx, when the data type is uint8_t, producing 0 instead of 255. This issue stems from the clamping logic; we should avoid converting min_val to uint8_t too early
Fixes https://github.com/pytorch/pytorch/issues/156788
@leslie-fang-intel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157267
Approved by: https://github.com/leslie-fang-intel
Summary: We found there's a special case in recent APS model where the input tensor has smaller size compared to the split size. It will be automatically truncated in split.Tensor thus we add extra condition check for split_with_sizes when do the normalization.
Test Plan:
### unit
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_aten_normalization
```
Buck UI: https://www.internalfb.com/buck2/2ecd1ef8-8efe-4245-b4c8-282c23645b3c
Test UI: https://www.internalfb.com/intern/testinfra/testrun/7599824648585787
Network: Up: 3.9GiB Down: 9.2GiB (reSessionID-1396c91e-0dd2-457b-a49b-a6ab1f2a7d8f)
Loading targets. Remaining 0/5344 99617 dirs read, 1074949 targets declared
Analyzing targets. Remaining 0/123279 4988547 actions, 5966764 artifacts declared
Executing actions. Remaining 0/728058 209:52:59.9s exec time total
Command: test. Finished 12466 local, 209448 remote, 1226 cache (1% hit) 42:10.5s exec time cached (0%)
Time elapsed: 26:07.6s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
### E2E
before fix:
aps-afoc_apop_pt2_v0-db2fe0449a
after fix:
aps-afoc_apop_pt2_v0-755ad0cdc6
Rollback Plan:
Differential Revision: D77961394
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157857
Approved by: https://github.com/anijain2305
Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if `package_cpp_only` is explicitly set to False in config.
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r TestAOTInductorConfig
```
Rollback Plan:
Differential Revision: D77889754
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731
Approved by: https://github.com/desertfire
Fixes#157673
For the call trace:
```
......
File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\common.py", line 2569, in reduction
return self.kernel.reduction(dtype, src_dtype, reduction_type, value)
File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 2155, in reduction
self._gen_parallel_reduction_buffers(acc, acc_type, reduction_type, init_dtype)
File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 1942, in _gen_parallel_reduction_buffers
reduction_prefix_array(
File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\codegen\cpp.py", line 335, in reduction_prefix_array
if cpp_builder.is_msvc_cl()
File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 317, in is_msvc_cl
return _is_msvc_cl(get_cpp_compiler())
File "D:\Programs\Python\virtualenvs\torch_code-afvE469o\lib\site-packages\torch\_inductor\cpp_builder.py", line 240, in _is_msvc_cl
subprocess.check_output([cpp_compiler, "/help"], stderr=subprocess.STDOUT)
torch._inductor.exc.InductorError: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
```
On non-English language pack msvc environment, compiler path has raised `utf-8` issue. I add the `normalize_path_separator` to normalize the compiler path and avoid the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157835
Approved by: https://github.com/jansel
Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.
2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.
We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.
Removing this unnecessary create + record event should save a few us for each instance avoided.
Fixes https://github.com/pytorch/pytorch/issues/157407
----
Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715
Previeusly, if users want to let pytorch determine the cuda arch when jit loading cuda extensions, they should left environment variable `TORCH_CUDA_ARCH_LIST` empty, but which will raise an warning. This commit add an option to set `TORCH_CUDA_ARCH_LIST=native`, to tell pytorch users want to use native cuda arch intentionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156923
Approved by: https://github.com/ezyang
This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo.
The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP.
There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780
This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set.
Test plan:
```
pytest test/distributed/test_c10d_gloo.py -v -k wait_stream
pytest test/distributed/test_c10d_nccl.py -v -k wait_stream
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
# Feature
If a Triton kernel has a complicated indexing expression, Inductor may decide to precompute it on the host and pass it to the kernel as an argument. This happens in situations like broadcasts with dynamic shapes.
This PR adds support for this feature to Inductor's FX IR backend.
We generate FX IR for precomputed size args in 3 steps:
1. In `PythonWrapperCodegen`, this PR refactors the relevant code to use a `SymbolicCallArgLine` instead of raw Python strings. This stores a (symbol, expr) pair. (Prior to this PR, it was (str, expr), but changing this to a symbol makes it easier to do substitutions later on.)
2. In `WrapperFxCodegen`, keep a dict of {symbol: expr} arg defs which gets updated whenever we see a `SymbolicCallArgLine`.
3. When the FX backend sees a `KernelCallLine`, it uses this dict to replace symbolic call args with their definitions.
In the longer run, it might be desirable to emit FX nodes defining these symbolic call args. That way, we could reuse the size computation when the same kernel is called multiple times. However, I wasn't sure if there was an existing way to generate FX nodes from a sympy expression, and implementing that seemed like overkill for the present purposes.
# Test plan
Added a new CI test exercising this feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157758
Approved by: https://github.com/jansel
Via google search I got to `torch.autograd.profiler` and implemented my code with it. Only to be taken by surprise finding `torch.profile.profiler`, which has a note saying the autograd one is legacy.
This just adds such note to `autograd.profiler` to avoid this confusion and waste of time to future people in my situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157459
Approved by: https://github.com/sraikund16
Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff
Test Plan:
ensure tests pass
Rollback Plan:
Differential Revision: D77871099
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715
Approved by: https://github.com/meetv18
We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920
Approved by: https://github.com/soulitzer
This is to unblock "dp2ep" Expert Parallel + TP integration in torchtitan https://github.com/pytorch/torchtitan/pull/1324.
It does two things:
1. Slightly modifies the glue code for FSDP/HSDP + TP to work with FSDP/HSDP + EP and FSDP/HSDP + EP + TP. I kept the name `FSDPParam._tp_spec` to make the change minimal. We can consider renaming it in the future if it confuses people, but I heard @wanchaol has a plan to rewrite DTensor strided sharding entirely.
2. Lifts the check of `_validate_tp_mesh_dim` for `torch.distributed.tensor.parallel.parallelize_module`, as in EP or EP+TP this check is too strict. In particular it assumes a DeviceMesh must have `mesh_dim_names` which is not always true. I'm also removing the file `torch/distributed/tensor/parallel/_utils.py` it belongs entirely, as the other check `_deprecate_warnings`, added two years ago, is not used any more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157216
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
Summary:
I'm fairly sure the use of a custom metaclass is a holdover from pre-3.7 where Generic used a custom metaclass so we had to use multiple inheritance to avoid import-time failures.
At this point, `type(Generic)` is just `type` so it isn't needed, and we will get the least metaclass from our base classes, which means the `type(torch._C.Future)` isn't needed either, it will happen automatically just by inheritance.
Test Plan:
I'm fairly confident from local testing that this should be a no-op.
But also, Pytorch CI should give us pretty strong signal that this change doesn't break anything in case there's some edge case I missed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157757
Approved by: https://github.com/ezyang, https://github.com/Skylion007
This is useful for vLLM, which runs AOTAutograd directly on graphs after
they have been split.
I created a new flag for this instead of reusing
`keep_original_node_name` (please let me know if you think I should reuse this).
The reasoning is:
- The names of the placeholder nodes is different from the targets of
the placehoder nodes. The targets are the actual input names.
- Backwards compatibility: this API has been out for ~4 years, it
looks public, and it has extensive public use. For example, this change
would actually be BC-breaking to vLLM (they rely on the subgraph input
names being different at the moment).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157733
Approved by: https://github.com/ezyang
Update s390x test marks
test_logs_out from test/dynamo/test_logging.py is updated
and no longer fails on s390x.
test_qengine from test/test_torch.py doesn't work on s390x:
no QEngine is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157541
Approved by: https://github.com/huydhn
In this PR, we are enabling `HPU` device-specific function calls for random operations. These calls will manage the setting and unsetting of the `context of Random Number Generator`.
While HPU devices typically utilize a `Mersenne-based RNG`, Dtensor-specific random operations employ an `offset-based (Philox) RNG tracker` which is specifically integrated with `CUDA` in scope.
To integrate a similar offset-based RNG tracker within the `HPU backend`, a backend-specific device handle function is necessary to identify the execution context of these random operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156581
Approved by: https://github.com/jeromean, https://github.com/wanchaol
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.
# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair
## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`
## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
Target determination sorts the tests in a PR CI run based on heuristics about which tests are more relevant to the PR's changes. This can help provide faster CI signal as well as help alleviate capacity concerns as job durations should decrease due to catching failures earlier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156545
Approved by: https://github.com/jeffdaily, https://github.com/clee2000
# Motivation
https://github.com/pytorch/pytorch/pull/155451 decoupled `torch._C._storage_Use_Count` from CUDA and introduced a corresponding unit test:
815545f2dd/test/test_torch.py (L257-L262)
However, this test fails when PyTorch is built with debug assertions enabled. @clee2000 disabled this UT in https://github.com/pytorch/pytorch/pull/156731. The root cause is that `_cdata` is obtained from an `intrusive_ptr`, not a `weak_intrusive_ptr`. As a result, calling `c10::weak_intrusive_ptr::use_count` on it triggers the internal assertion:
815545f2dd/c10/util/intrusive_ptr.h (L912-L917)
For example:
```python
a = torch.randn(10, device=device) # refcount=1, weakcount=1
prev_cf = torch._C._storage_Use_Count(a.untyped_storage()._cdata) # violate the assertation
```
This violates the expected invariant inside `weak_intrusive_ptr::use_count`, which assumes the pointer was originally constructed from a valid `weak_intrusive_ptr`. Actually, `storage_impl` is obtained from an `intrusive_ptr`.
815545f2dd/torch/csrc/Module.cpp (L2105-L2109)
# Solution
Use `c10::intrusive_ptr::use_count` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157694
Approved by: https://github.com/albanD
Getting the build issue because of enablement of data type fp8 for onednn in qlinear and qlinear_prepack file after this commit c2185dc4a5626848df37cad214b73d5ae7dd4f17
Currrently cpuinfo is disable for power system because of that it is giving below error.
**Error:**
‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
Made a required changes and now build issue got fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157469
Approved by: https://github.com/malfet
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.
Tested against system install location: /usr/local/lib and /usr/local/include.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.
When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile
You can also use PrecompileContext.serialize() to manually serialize a full object.
I've added unit tests to exhibit this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.
Tested against system install location: /usr/local/lib and /usr/local/include.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
The structure is
```
torch/
__init__.py
version.py
```
When we import torch, only `torch/__init__.py` is executed by default.
The submodules like `version.py` are not automatically imported or attached to the torch module.
So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data
Test Plan:
added a test
Rollback Plan:
Differential Revision: D76135887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
Note on backward precision over fp16:
A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024) \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))
Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024) \exp2(-3) \approx 0.002$.
The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024) \exp2(-2) \approx 0.002$. Same for $e_b=-1$.
So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9 * 0.002$.
That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'.
These small but meaningful changes help improve code readability and maintain the overall quality of the codebase.
Thanks for your time and review!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651
Approved by: https://github.com/Skylion007
***SUMMARY***
The main function in this tests overrides that of the Gtest framework which contains it's `RUN_ALL_TESTS()` function. The main function in this test is called conditionally when conditions apply, in this case, when the C10_MOBILE directive is provided. This is wrong as we always want to call the `RUN_ALL_TEST()` function.
In this PR, we only make the test suite available for cases that apply, i.e if the C10_MOBILE directive exist which represents the caching allocator and is only exposed on mobile
***TEST PLAN***
This tests should run in modes where it applies which should be covered in the CI run.
Below shows a sample run in the dev-nosan mode which do not have the cache allocator
BEFORE
```
buck test fbcode//caffe2:cpu_caching_allocator_test
Discovered 0. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
⚠ Listing failed: caffe2:cpu_caching_allocator_test
Listing tests failed with error:
Failed to read from /data/users/ysuleiman/fbsource/buck-out/v2/test/buck-out/v2/test_discovery/fbcode/6dcc55a61c1b90b3/default/tpx_execution_dir/gtest_output_file.json. Listing process stdout: , stderr:
```
AFTER
```
buck test '@fbcode//mode/dev-nosan' fbcode//caffe2:cpu_caching_allocator_test
Analyzing targets. Remaining 0/46242 1871690 actions, 2251668 artifacts declared
Executing actions. Remaining 0/257870 83:28:24.4s exec time total
Command: test. Finished 10 remote, 112314 cache (99% hit) 83:22:43.5s exec time cached (99%)
Time elapsed: 2:57.7s
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
```
Rollback Plan:
steps:
- manual.note:
content: Revert this diff
Reviewed By: patskovn
Differential Revision: D77229077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156816
Approved by: https://github.com/kimishpatel
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py
setuptools-git-versioning==2.1.0
scikit-build==0.18.1
pyre-extensions==0.0.32
tabulate==0.9.0
#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI
The process involves compiling thousands of files, and would take a long time. Fortunately, the compiled objects can be useful for your next build. When you modify some files, you only need to compile the changed files the next time.
@ -88,14 +88,13 @@ source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
* If you want to have no-op incremental rebuilds (which are fast), see [Make no-op build fast](#make-no-op-build-fast) below.
* When installing with `python setup.py develop` (in contrast to `python setup.py install`) Python runtime will use
* When installing with `python -m pip install -e .` (in contrast to `python -m pip install .`) Python runtime will use
the current local source-tree when importing `torch` package. (This is done by creating [`.egg-link`](https://wiki.python.org/moin/PythonPackagingTerminology#egg-link) file in `site-packages` folder)
This way you do not need to repeatedly install after modifying Python files (`.py`).
However, you would need to reinstall if you modify Python interface (`.pyi`, `.pyi.in`) or
@ -785,7 +793,7 @@ If you are editing a single file and rebuilding in a tight loop, the time spent
Starting with CMake 3.29, you can specify the linker type using the [`CMAKE_LINKER_TYPE`](https://cmake.org/cmake/help/latest/variable/CMAKE_LINKER_TYPE.html) variable. For example, with `mold` installed:
Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In order to link against iomp, you'll need to manually download the library and set up the building environment by tweaking `CMAKE_INCLUDE_PATH` and `LIB`. The instruction [here](https://github.com/pytorch/pytorch/blob/main/docs/source/notes/windows.rst#building-from-source) is an example for setting up both MKL and Intel OpenMP. Without these configurations for CMake, Microsoft Visual C OpenMP runtime (vcomp) will be used.
@ -329,7 +335,6 @@ Additional libraries such as
You can refer to the [build_pytorch.bat](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/win-test-helpers/build_pytorch.bat) script for some other environment variables configurations
```cmd
cmd
@ -349,8 +354,7 @@ for /f "usebackq tokens=*" %i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\
:: [Optional] If you want to override the CUDA host compiler
set CUDAHOSTCXX=C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64\cl.exe
// is possible for the backward function to return an undefined grad for some
// grad_input for each example. In that case, we return an undefined grad.
//
// It is theoretically posssible for *some* of the examples to produce an
// It is theoretically possible for *some* of the examples to produce an
// undefined grad (a kernel could peek at the gradient values and return an
// undefined tensor if it determines the gradient is full of zeros). We
// could handle this by treating the undefined grad as a zero-filled tensor
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.