Compare commits

..

298 Commits

Author SHA1 Message Date
e44ca7305f vllm setup
Signed-off-by: Yang Wang <elainewy@meta.com>
2025-07-21 17:40:32 -07:00
2bb684304d Fix the typos in the right nav by pulling the latest theme (#158746)
This will fix broken links in the right nav.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158746
Approved by: https://github.com/malfet
2025-07-21 22:51:07 +00:00
f09a484b81 Remove is_arvr_mode() from xnnpack.buck.bzl (#158682)
Summary:
**Changes**
*   Deleted function import from build definition utilities
    *   Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
*   Replaced is_arvr_mode() function calls with direct references to configuration flags
    *  Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
*   Changed conditional expressions to Buck `select()` statements

Test Plan:
Check if CI passes

Rollback Plan:

Differential Revision: D78520947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet
2025-07-21 22:49:26 +00:00
feaa02f9ad Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a78fb63dbdf98a1db219095293de1a11005e0390.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to It still breaks inductor-perf-nightly, see https://github.com/pytorch/pytorch/actions/runs/16425364208/job/46417088208, I'm going to dismiss all previous reviews ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3099706457))
2025-07-21 22:46:53 +00:00
b3c868d603 [vllm]Add vllm.txt for pinned commit (#158754)
It seems the nightly.yml won't auto-generate txt file when it does not existed, so added the file with latest merged commit from vllm:

[vllm commit](https://github.com/vllm-project/vllm/commits/main)

Error:
https://github.com/pytorch/pytorch/actions/runs/16405915719/job/46351847504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158754
Approved by: https://github.com/huydhn
2025-07-21 22:41:07 +00:00
cab28330f8 Setup TorchBench in Docker (#158613)
This reduces the time spending to setup TorchBench in A100/H100 by another half an hour

### Testing

* H100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396172453.  Once this done, I will review the results on [HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2011%20Jul%202025%2023%3A01%3A24%20GMT&stopTime=Fri%2C%2018%20Jul%202025%2023%3A01%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/huydhn/6/head&lCommit=14a38c719b29a19f518239b5edb084838ac5d2fb&rBranch=main&rCommit=0a99b026d6bd0f67dc2c0a20fe3228ddc4144854) to confirm that all models are there
* A100 benchmark https://github.com/pytorch/pytorch/actions/runs/16396173932

Signed-off-by: Huy Do <huydhn@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158613
Approved by: https://github.com/janeyx99
2025-07-21 22:34:08 +00:00
4366610f5a [c10d] block_current_stream: correctness fixes (#158757)
This fixes a number of issues that were present in https://github.com/pytorch/pytorch/pull/156883 as pointed out by @ngimel

Test plan:

Expanded tests to cover use after free behavior + non-default stream

```
pytest test/distributed/test_c10d_pypg.py -v -k block_current_stream
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158757
Approved by: https://github.com/ngimel
2025-07-21 22:23:44 +00:00
dd0adc9386 [SymmMem] Add NVSHMEM broadcast support into Triton (#158514)
Adds broadcast collective operation for distributing data from root PE to all other PEs in NVSHMEM Triton kernels.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_broadcast`
<details>
<summary> Quick debug print for sanity check </summary>

```markdown
============================================================
[Rank 0] Starting broadcast test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
============================================================
[Rank 1] Starting broadcast test with world_size=2
============================================================
[Rank 1] Configuration:
  - nelems: 4
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 32
[Rank 1] Non-root source data: [-1, -1, -1, -1]
[Rank 0] Root source data: [100, 101, 102, 103]
[Rank 1] Initial destination: [-999, -999, -999, -999]
[Rank 0] Initial destination: [-999, -999, -999, -999]
[Rank 0] Executing broadcast operation...
[Rank 1] Executing broadcast operation...
[Rank 0] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Broadcast operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 1] Results after broadcast:
[Rank 0] Results after broadcast:
[Rank 1] Destination buffer: [100, 101, 102, 103]
[Rank 1] Expected: [100, 101, 102, 103]
[Rank 0] Destination buffer: [100, 101, 102, 103]
[Rank 0] Expected: [100, 101, 102, 103]
[Rank 1] Match: ✓
[Rank 0] Match: ✓
[Rank 1] ============================================================
[Rank 1] Broadcast test PASSED ✓
[Rank 1] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 1] ============================================================
[Rank 0] ============================================================
[Rank 0] Broadcast test PASSED ✓
[Rank 0] Summary: Root PE 0 broadcasted [100, 101, 102, 103] to all PEs
[Rank 0] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158514
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512, #158513
2025-07-21 22:23:26 +00:00
734826d88e Revert "[AOTI] windows package load dev (#158671)"
This reverts commit d42c40976727fed4c9908d4194f26917d0a3da66.

Reverted https://github.com/pytorch/pytorch/pull/158671 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @angelayi can you please help them validate the fixes internally? You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158671#issuecomment-3099570374))
2025-07-21 22:20:46 +00:00
5a56e6a72b Revert "[AOTI] fix extract file failed on Windows. (#158702)"
This reverts commit 7cc1a9546c135f8e7635e0d38aa2bba797f8907d.

Reverted https://github.com/pytorch/pytorch/pull/158702 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158702#issuecomment-3099556215))
2025-07-21 22:18:19 +00:00
e8af168ee0 Revert "[AOTI] normalize path and process model files. (#158705)"
This reverts commit ff0da08f4bc5ee135b495926cd58a36a1c0e1a5b.

Reverted https://github.com/pytorch/pytorch/pull/158705 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158705#issuecomment-3099532516))
2025-07-21 22:16:03 +00:00
97d7dc197f Revert "[AOTI] Convert C-struct zip handling to RAII container (#158687)"
This reverts commit 8ed5e1844c77d952bcea89ca7d0225d876fec4e8.

Reverted https://github.com/pytorch/pytorch/pull/158687 on behalf of https://github.com/ZainRizvi due to Sorry but I had to revert this PR in order to revert https://github.com/pytorch/pytorch/pull/158671 ([comment](https://github.com/pytorch/pytorch/pull/158687#issuecomment-3099515618))
2025-07-21 22:13:26 +00:00
9498d95b9c [Dynamo][BetterEngineering] Type trace_rules.py (#158679)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `trace_rules.py`
Running
```
mypy torch/_dynamo/trace_rules.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2564 | 3997 | 64.15% | 34 | 53 | 64.15% |
| This PR | 4022 | 4022 | 100.00% | 53 | 53 | 100.00% |
| Delta    | +1458 | +25 | +35.85% | +19 | 0 | +35.85% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158679
Approved by: https://github.com/williamwen42
2025-07-21 22:12:59 +00:00
0e46f54286 [ROCm][CI] update HIP patch for 6.4.1 (#158651)
patch is intended to fix hipGraph capture for some miopen kernels

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-21 22:09:36 +00:00
216ba6e5f2 Fix MaskedTensor to device ignored mask (#151205)
Fixes #147140

## Changes

- Add `to` implementation in `MaskedTensor` to support move `mask` to target device

## Test Result

```python
In [1]: import torch
   ...: from torch.masked import as_masked_tensor
   ...: data = torch.tensor([1,2,3])
   ...: mask = torch.tensor([True,False,True])
   ...: mt = as_masked_tensor(data, mask).to('cuda')
   ...: mt.get_data().device, mt.get_mask().device
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
/home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(new_data, _maybe_get_mask(args[0]))
Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0))

In [2]: mt.sum(dim=0)
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
  return MaskedTensor(data, mask)
Out[2]: MaskedTensor(4, True)

```

```bash
pytest test/test_maskedtensor.py -vv
```

![image](https://github.com/user-attachments/assets/640b809c-b4f0-4aca-a09e-04049017a745)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205
Approved by: https://github.com/ezyang
2025-07-21 21:44:49 +00:00
c774180e59 Bump requests from 2.32.2 to 2.32.4 in /tools/build/bazel (#158006)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.32.4</h2>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
<li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
</ul>
<h2>v2.32.3</h2>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS.</li>
<li>Dropped support for pypy 3.9 following its end of support.</li>
</ul>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="021dc729f0"><code>021dc72</code></a> Polish up release tooling for last manual release</li>
<li><a href="821770e822"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li>
<li><a href="59f8aa2adf"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li>
<li><a href="5b4b64c346"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li>
<li><a href="7bc45877a8"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li>
<li><a href="96ba401c12"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li>
<li><a href="7341690e84"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li>
<li><a href="6716d7c9f2"><code>6716d7c</code></a> remove links</li>
<li><a href="a7e1c745dc"><code>a7e1c74</code></a> Update docs/conf.py</li>
<li><a href="c799b8167a"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.2&new-version=2.32.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-21 21:35:38 +00:00
a991e285ae [AOTI] Add more default options to compile_standalone (#158560)
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
2025-07-21 21:16:48 +00:00
9e0473b566 removed zero dim cpu logic from fake_tensor.py (#147501)
Fixes #144748
In #144748, the inconsistency between the eager mode and the inductor mode is reported as a bug.
The root cause is fake_tenosr.py's find-common-device method, 0b0da81021/torch/_subclasses/fake_tensor.py (L833), takes zero dim cpu tensor into account but  the device check in adaption.h doesn't.

This fix is to add a list for some ops to bypass zero-dim-cpu-tensor check to align with the eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147501
Approved by: https://github.com/ezyang
2025-07-21 21:11:10 +00:00
5e17932c22 [DCP] Add support for ShardedTensor to PgTransport (#158573)
Add support for ShardedTensors in when PGTransport is used for send/recv checkpoints

Test is pulled from https://github.com/pytorch/pytorch/pull/157963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158573
Approved by: https://github.com/meetv18
2025-07-21 21:04:23 +00:00
6b0526a2c4 ban fusion of large amount of reads (#158667)
This is an reland attempt of https://github.com/pytorch/pytorch/pull/157563, but insteading of introducing the `realize_acc_reads_size_threshold` config and setting to a default value, we set it to `None` for now to unblock an internal use case. Will deep dive into the issue and harden the logic in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158667
Approved by: https://github.com/yf225
2025-07-21 21:00:40 +00:00
bc379aebe2 Revert "Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)"
This reverts commit 8e57cdb746b4ab28865fdf01532f87b0d21700e9.

Reverted https://github.com/pytorch/pytorch/pull/158048 on behalf of https://github.com/jeffdaily due to rocm failures due to unit test introduced in this PR, but no pre-merge signal available ([comment](https://github.com/pytorch/pytorch/pull/158048#issuecomment-3098746624))
2025-07-21 20:45:21 +00:00
b1a0c34dd3 [pt2 event logging] add configurable prefix (#157678)
Summary:
# Why

make experiments easier to find

# What

- dynamo config to provide a prefix
- use the prefix when sending data to scuba through the self.id_ field

Test Plan:
```
# code edited to set the prefix as `coconutruben-02`
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```

on scuba

```
| autotune_dtypes | autotune_offset | autotune_shape | autotune_strides | event | run_id |
| -----| -----| -----| -----| -----| ----- |
| "torch.float16, torch.float16" | "0, 0" | "4096x3008, 3008x2048" | "[3008, 1], [2048, 1]" | "mm_template_autotuning" | "coconutruben-02-e6bdccc5-6dcf-4d68-9a04-b34f2c6d94fd" |
| "torch.float16, torch.float16" | "0, 0" | "4096x3008, 3008x2048" | "[3008, 1], [2048, 1]" | "mm_template_autotuning" | "coconutruben-02-14165153-5842-4eaa-9e6c-3b0cbc016375" |

```

Rollback Plan:

Differential Revision: D77837550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157678
Approved by: https://github.com/stashuk-olek
2025-07-21 20:41:03 +00:00
851e953f68 ci: Only run lint jobs on relevant files (#158773)
Conditionally run lint jobs on relevant files, this
is mainly targetd at clangtidy since it takes a long time
but also includes mypy since that's an additional 4 minutes
of runtime that we can save.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158773
Approved by: https://github.com/malfet
2025-07-21 20:21:34 +00:00
b66f429827 Fix torch.randint, torch.mul param missing description (#158731)
Wrong separator cause param description truncated.

- Change separator of param and its description
- Remove quote make `torch.dtype` display as reference to the class

## Test Result

### Before

<img width="1092" height="784" alt="image" src="https://github.com/user-attachments/assets/e8d96b26-07e9-40ff-9392-fa6665d4bbe4" />
<img width="1111" height="457" alt="image" src="https://github.com/user-attachments/assets/a3c2e333-f861-4aeb-b4fb-05c8d880ae81" />

### After

<img width="897" height="820" alt="image" src="https://github.com/user-attachments/assets/d1b5cefa-717a-4223-84b0-4346b7eecf44" />
<img width="872" height="409" alt="image" src="https://github.com/user-attachments/assets/96223c37-cd9d-4656-9e55-032d09cbe5c1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158731
Approved by: https://github.com/ngimel
2025-07-21 20:17:27 +00:00
ea5b06ed5b [Dynamo][BetterEngineering] Type side_effects.py (#158605)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a core file, `side_effects.py`
Running
```
mypy torch/_dynamo/side_effects.py   --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  365 | 1166 | 31.30% | 16 | 51 | 31.37% |
| This PR | 1185 | 1185 | 100.00% | 51 | 51 | 100.00% |
| Delta    | +820 | +19 | +68.70% | +35 | 0 | +68.63% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158605
Approved by: https://github.com/StrongerXi
2025-07-21 19:34:14 +00:00
25fbf09d5f Use more fine-grained locks in sym mem kernels (#158523)
Summary: Use only acq in the beginning of the kernel, and only release in the end

Test Plan:
Existing tests

Rollback Plan:

Differential Revision: D78458020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158523
Approved by: https://github.com/drisspg, https://github.com/kwen2501
2025-07-21 19:23:47 +00:00
22920c9138 Grab bag of (mostly) typing improvements (#158075)
Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are:

1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`.
2. Removing two unused parameters from `torch.export._unlift._unlift`.
3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency.
4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075
Approved by: https://github.com/Skylion007
2025-07-21 19:17:01 +00:00
ad2dec1997 [SymmMem] Add NVSHMEM alltoall support into Triton (#158513)
Implements collective alltoall operation for NVSHMEM Triton kernels. Enables data exchange where each PE sends unique data to every other PE in the team.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_alltoall`

<details>
<summary>Quick debug print for sanity check</summary>

```markdown
============================================================
[Rank 0] Starting alltoall test with world_size=2
============================================================
[Rank 0] Configuration:
  - nelems_per_pe: 2
  - dtype: torch.int64, element_size: 8 bytes
  - nelems_bytes: 16
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
[Rank 0] Preparing source data:
[Rank 1] Preparing source data:
  - Data for PE 0: [0, 0] (indices 0-1)
  - Data for PE 1: [1, 1] (indices 2-3)
[Rank 0] Complete source buffer: [0, 0, 1, 1]
  - Data for PE 0: [100, 100] (indices 0-1)
  - Data for PE 1: [101, 101] (indices 2-3)
[Rank 1] Complete source buffer: [100, 100, 101, 101]
[Rank 1] Initial destination buffer: [-1, -1, -1, -1]
[Rank 0] Initial destination buffer: [-1, -1, -1, -1]
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[rank0]:[W716 15:30:06.215666766 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank1]:[W716 15:30:06.215752786 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
NCCL version 2.27.5+cuda12.4
[Rank 1] Executing alltoall operation...
[Rank 0] Executing alltoall operation...
[Rank 1] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[Rank 0] Results after alltoall:
[Rank 1] Results after alltoall:[Rank 0] Destination buffer: [0, 0, 100, 100]
[Rank 0] Verifying results:
  - From PE 0 (indices 0-1):
    Expected: [0, 0]
    Actual:   [0, 0]
[Rank 1] Destination buffer: [1, 1, 101, 101]
[Rank 1] Verifying results:
  - From PE 0 (indices 0-1):
    Expected: [1, 1]
    Actual:   [1, 1]
    Match:    ✓
    Match:    ✓
  - From PE 1 (indices 2-3):
    Expected: [100, 100]
  - From PE 1 (indices 2-3):
    Expected: [101, 101]
    Actual:   [100, 100]
    Actual:   [101, 101]
    Match:    ✓
    Match:    ✓
[Rank 0] ============================================================
[Rank 0] Summary: ALL TESTS PASSED ✓
[Rank 0] Data flow explanation:
  - Each rank sends 2 elements to every other rank
[Rank 1] ============================================================
[Rank 1] Summary: ALL TESTS PASSED ✓
  - Rank 0 sent: [0, 0, 1, 1]
[Rank 1] Data flow explanation:
  - Each rank sends 2 elements to every other rank
  - Rank 0 received: [0, 0, 100, 100]
  - My data for PE 0 (0) went to PE 0's buffer
  - I received PE 0's data for me (0)
  - My data for PE 1 (1) went to PE 1's buffer
  - Rank 1 sent: [100, 100, 101, 101]
  - I received PE 1's data for me (100)
[Rank 0] ============================================================
  - Rank 1 received: [1, 1, 101, 101]
  - My data for PE 0 (100) went to PE 0's buffer
  - I received PE 0's data for me (1)
  - My data for PE 1 (101) went to PE 1's buffer
  - I received PE 1's data for me (101)
[Rank 1] ============================================================
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158513
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512
2025-07-21 19:14:47 +00:00
662dd7db5b [cutlass backend] cache maybe_append_choices (#156781)
This PR attempts to cache:
* codegen for cutlass backend for the same kernel. Even if runtime params are different.

From some profiling, most of the time spent is on render. So we only target to cache that part for now.

The output of render is `code`, and we are able to cache that easily. Also, I have to cache size_args, since it depends on `kernel.get_dynamic_shape_args()`, which depends on the state of self when we call render.

make_key is doing most of the work here: We are hashing on input node layouts, output node layout and op.configuration_name() (this is what hash(op) would do anyway).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156781
Approved by: https://github.com/ColinPeppler
2025-07-21 19:02:39 +00:00
72db0a98a3 Revert "[DTensor] Assert DTensorSpec has valid placements (#158133)"
This reverts commit 1839e8d04b81ee6eda0cff6fbfc218a7a600f6f7.

Reverted https://github.com/pytorch/pytorch/pull/158133 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D78496151 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158133#issuecomment-3097994857))
2025-07-21 18:54:07 +00:00
8ed5e1844c [AOTI] Convert C-struct zip handling to RAII container (#158687)
Attempts to fix a memory leak reported in #158614 by wrapping manually managed MiniZ C-structs in an RAII container. I have been unable to reproduce the reported leak, but this seems like the most likely candidate.

Fixes #158614 (hopefully)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158687
Approved by: https://github.com/desertfire
2025-07-21 18:53:14 +00:00
393fecb2cc [Optimus][Unit test] clean up the unit test (#158696)
Summary: We should only patch the specific pattern(s) for each unit test.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```

Buck UI: https://www.internalfb.com/buck2/f8d37674-91c4-4244-90fa-f24fc3f91e4b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275088644915
Network: Up: 100KiB  Down: 233KiB  (reSessionID-92039f44-bc6f-4e78-87b1-93bca1bd1c66)
Analyzing targets. Remaining     0/296
Executing actions. Remaining     0/20196                                                                    5.8s exec time total
Command: test.     Finished 2 local, 2 cache (50% hit)                                                      4.6s exec time cached (79%)
Time elapsed: 3:55.1s
Tests finished: Pass 13. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D78598127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158696
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-07-21 18:05:09 +00:00
9285b8245c [BE][testing] fix test_cat_max_autotune_triton (#158589)
Summary: This test often fails internally -- looks like it's because autotuning sometimes chooses not to do the epilog tuning. Turning off `benchmark_epilogue_fusion` seems to fix.

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_cat_max_autotune_triton (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158589
Approved by: https://github.com/eellison
2025-07-21 18:02:18 +00:00
637e75433c [BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2025-07-21 17:56:05 +00:00
a78fb63dbd [build] pin setuptools>=77 to enable PEP 639 (#158104)
For reference here is the link PEP 639: [peps.python.org/pep-0639](https://peps.python.org/pep-0639/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2025-07-21 17:46:40 +00:00
7205458b85 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-21 17:32:31 +00:00
35f1b4ad9e Revert "Fused RMSNorm implementation (#153666)"
This reverts commit 15ef4f28df0a14e9f0d55a57a4e2db415a303be7.

Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts.  See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935))
2025-07-21 17:31:42 +00:00
cbe1cb7018 [CMake] Move xpu flag to xpu.cmake (#158542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542
Approved by: https://github.com/gujinghui, https://github.com/ezyang
2025-07-21 17:19:59 +00:00
9894d43b6c [AOTI] explicit aoti wrapper functions for Windows. (#158713)
On Windows, we need to explicit declaration for export APIs. Because the package loader call these API via GetProcAddress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158713
Approved by: https://github.com/desertfire
2025-07-21 15:59:44 +00:00
f168cf49a8 [BE] Always use python 3.9 for pre-push hook's lintrunner (#158693)
A follow up to https://github.com/pytorch/pytorch/pull/158389

Sets up the pre-push lintrunner to always use python 3.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158693
Approved by: https://github.com/atalman
2025-07-21 15:19:27 +00:00
393377d215 Revert "[CI] update flake8 and mypy lint dependencies (#158720)"
This reverts commit a527e816935957a164d74dd7c5069310b2857695.

Reverted https://github.com/pytorch/pytorch/pull/158720 on behalf of https://github.com/malfet due to This broke lint, see 8e57cdb746/1 ([comment](https://github.com/pytorch/pytorch/pull/158720#issuecomment-3096893256))
2025-07-21 13:58:50 +00:00
8e57cdb746 Still run TritonBundler with BundledAOTAutogradCache, save autotune results (#158048)
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.

It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
2025-07-21 13:35:46 +00:00
d5a29fc58a De-abstract premature generalization with InductorWrapper (#158528)
See docblock on InductorWrapper for the distinction.  This will matter
on a later refactor PR where I will change the signature for one of
these but not the other.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158528
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158449
2025-07-21 13:27:07 +00:00
979fae761c Rename modules in AOTAutograd (#158449)
Fixes https://github.com/pytorch/pytorch/issues/158382

```
renamed:    torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py -> torch/_functorch/_aot_autograd/graph_capture.py
renamed:    torch/_functorch/_aot_autograd/traced_function_transforms.py -> torch/_functorch/_aot_autograd/graph_capture_wrappers.py
renamed:    torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py -> torch/_functorch/_aot_autograd/graph_compile.py
```

Everything else is ONLY import changes. I did not rename any functions
even if we probably should have.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158449
Approved by: https://github.com/jamesjwu
2025-07-21 13:27:07 +00:00
1eb6b2089f [Inductor] Set the default value of min_chunk_size to 512 (#150762)
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-07-21 12:46:05 +00:00
bbc32d680f [SymmMem] Add NVSHMEM sync_all support into Triton (#158512)
Adds `sync_all()` function for local store visibility synchronization in NVSHMEM Triton kernels. Provides memory ordering for local operations without remote completion guarantees.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_sync`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158512
Approved by: https://github.com/fduwjj
ghstack dependencies: #158511
2025-07-21 10:27:59 +00:00
a527e81693 [CI] update flake8 and mypy lint dependencies (#158720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158720
Approved by: https://github.com/Skylion007
2025-07-21 09:24:29 +00:00
1c6328a588 [EZ][BE] Fix compilation warning in Pooling.metal (#158729)
This one
```
Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal to Pooling_30.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:172:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.
```
Although functionally one is not supposed to hit this codepath ever, it's not not to throw warning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158729
Approved by: https://github.com/Skylion007
2025-07-21 04:34:14 +00:00
70b4a8880b [SymmMem] Add NVSHMEM barrier_all, my_pe, n_pes support into Triton (#158511)
Adds device-side barrier synchronization and PE identification functions for NVSHMEM Triton integration. Includes `barrier_all()` for collective synchronization and `my_pe()`/`n_pes()` for PE identification within kernels.

We are launching with cooperative grid launch (for all the PRs in this stack) because the `nvshmemx_collective_launch` function must be used to launch kernels on the GPU when the kernels use NVSHMEM synchronization or collective APIs, and `nvshmemx_collective_launch` essentially boils down to a CUDA cooperative group launch.

Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_barrier`

Also tested that if you remove the barrier, you get an assertion error/race conditions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158511
Approved by: https://github.com/fduwjj
2025-07-21 02:37:33 +00:00
5e1232871b Revert "[build] pin setuptools>=77 to enable PEP 639 (#158104)"
This reverts commit a4ec381302f8acd279033707b182bed30ffd2091.

Reverted https://github.com/pytorch/pytorch/pull/158104 on behalf of https://github.com/malfet due to This break inductor-perf-nighly-macos by failing to build torchvision, see https://github.com/pytorch/pytorch/issues/158728 ([comment](https://github.com/pytorch/pytorch/pull/158104#issuecomment-3095048940))
2025-07-21 02:24:11 +00:00
ff0da08f4b [AOTI] normalize path and process model files. (#158705)
Continued to https://github.com/pytorch/pytorch/pull/158702 , split `zip_filename_str` and real file path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158705
Approved by: https://github.com/desertfire
2025-07-21 01:08:59 +00:00
2cdafab0bd [BE] Raise ValueError from torch.cat meta func (#158249)
Followup after https://github.com/pytorch/pytorch/pull/155460

From [Python documentation](https://docs.python.org/3/library/exceptions.html#ValueError):
> Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.

Raise [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when input-output types are incompatible with each other
> Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch.

> This exception may be raised by user code to indicate that an attempted operation on an object is not supported, and is not meant to be. If an object is meant to support a given operation but has not yet provided an implementation, [NotImplementedError](https://docs.python.org/3/library/exceptions.html#NotImplementedError) is the proper exception to raise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158249
Approved by: https://github.com/jbschlosser, https://github.com/Skylion007, https://github.com/albanD
2025-07-20 23:49:18 +00:00
4b02bd76d3 DCP safetensors test fix (#158685)
https://github.com/pytorch/pytorch/pull/158069 removed the consolidated output path argument without updating the test. Reported by a user here https://github.com/pytorch/pytorch/pull/156705#issuecomment-3090748034.
Adding back the logic from the original PR https://github.com/pytorch/pytorch/pull/158069 and fixing the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158685
Approved by: https://github.com/teja-rao
2025-07-20 22:52:54 +00:00
2e038793ef [inductor][templates] Finalize all registered hooks (#157270)
This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds:
- `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually.
- A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270
Approved by: https://github.com/eellison
2025-07-20 22:07:32 +00:00
5e149a6482 Add deprecation warning (#158203)
Summary: export_for_training exist because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we should deprecate and delete this API.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78240836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158203
Approved by: https://github.com/JacobSzwejbka
2025-07-20 17:02:01 +00:00
badf002014 [Reland] Add warning about removed sm50 and sm60 arches (#158700)
Related to https://github.com/pytorch/pytorch/issues/157517

Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.

Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
    Found <GPU Name> which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 7.0.

  warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
                        Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
                        Please see https://github.com/pytorch/pytorch/issues/157517
                        Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```

Please note I reverted original PR https://github.com/pytorch/pytorch/pull/158301 because it broke internal users. This is a reland, added added check for non empty torch.cuda.get_arch_list()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158700
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/eqy
2025-07-20 14:57:46 +00:00
4869f71170 don't set CUDA_MODULE_LOADING (#158712)
If needed, it'll be set in `_C._cuda_init()`. setenv is not threadsafe, so this can cause segfaults due to getenv/setenv races.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158712
Approved by: https://github.com/eqy
2025-07-20 01:36:26 +00:00
b4abf41425 Raise BufferError for DLPack buffer-related errors. (#150691)
This PR addresses the Array API documentation for [`__dlpack__`][1] and
[`from_dlpack`][2] by making some buffer-related errors `BufferError`
instead of `RuntimeError`, e.g. incompatible dtype, strides, or device.

[1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
[2]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.from_dlpack.html#from-dlpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150691
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #150216, #150217, #150218
2025-07-20 00:46:21 +00:00
a10f15718d [DLPack] Add support for missing keyword-arguments. (#150218)
This PR introduces the rest of the keyword-arguments added in DLPack
version 2023.12: `dl_device` and `copy`.

In summary, we handle these arguments in the C++ implementation of
`to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the
`maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also
introduces the following changes:

- Add a new Python API `torchDeviceToDLDevice()`, which is simply a
  refactoring of the `getDLDevice()` function at
  _aten/src/ATen/DLConvertor.cpp_.
- Add both keyword-arguments to the `from_dlpack()` function at
  _torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder
  method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218
Approved by: https://github.com/albanD
ghstack dependencies: #150216, #150217
2025-07-20 00:46:20 +00:00
1d526fe78f Fix DLPack stream logic. (#150217)
This PR fixes the logic for dealing with CUDA and ROCm streams whenever
we are trying to create a DLPack capsule from a tensor.

In summary, this PR:

- Uses the legacy default stream if `tensor.__dlpack__(stream=None)` is
  called for a CUDA tensor.
- Errors if `tensor.__dlpack__(stream=2)` is called for a CUDA tensor:
  PyTorch doesn't support the per-thread default stream.
- Errors if `tensor.__dlpack__(stream=stream)`, where `stream` is 1 or
  2, is called for a CUDA tensor using ROCm.

For more details, see [the documentation][1].

[1]: https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150217
Approved by: https://github.com/msaroufim, https://github.com/albanD
ghstack dependencies: #150216
2025-07-20 00:46:20 +00:00
b64f338da4 [DLPack] add NumPy exchange tests. (#150216)
This PR resolves an old TODO that requested NumPy DLPack exchange tests
once version 1.22 was required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150216
Approved by: https://github.com/msaroufim, https://github.com/albanD
2025-07-20 00:46:20 +00:00
a1cfe7f1df [nativert] benchmark util (#158678)
Differential Revision: D78514241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158678
Approved by: https://github.com/SherlockNoMad, https://github.com/georgiaphillips
2025-07-20 00:28:09 +00:00
d36afac83b Build domain libraries for all workflows with TorchBench config (#158601)
They are expensive GPU runners and should not spend time building packages

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158601
Approved by: https://github.com/ZainRizvi
2025-07-19 21:51:39 +00:00
7cc1a9546c [AOTI] fix extract file failed on Windows. (#158702)
Changes:
1. rename zip index name, and keep it out of normalize path.
2. normalize output path for extract file.

Extract files successful:
<img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158702
Approved by: https://github.com/angelayi
2025-07-19 08:58:42 +00:00
7cc5d03dfc Document the rest of the specific optimizer module APIs (#158669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669
Approved by: https://github.com/albanD
ghstack dependencies: #158483
2025-07-19 07:27:15 +00:00
f73594164a [BE] document Adadelta and Adagrad APIs properly (#158483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483
Approved by: https://github.com/albanD
2025-07-19 07:27:15 +00:00
a9f84021fb [CI] Fixes CI for CUDA Version > 12.9 (#157385)
Compute capabilities older than volta (inclusive) is no longer supported in CUDA Version > 12.9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157385
Approved by: https://github.com/eqy
2025-07-19 06:51:57 +00:00
22d82222c6 GenAI Layer Benchmark (#158536)
This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.

It covers all kernels supported by [quack](https://github.com/Dao-AILab/quack?tab=readme-ov-file#kernels-) (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.

## Motivations

- Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark.
- We recently added GenAI model benchmark (based on [vLLM](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm)). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize.

## Key Settings

- Avoid reusing a kernel specializing on 1 shape for benchmark on another shape.
```python
torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000
```

- For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark.
```
torch._dynamo.mark_dynamic(x, 0)
```

GPU: H100 (devvm006.dkl0)

Results: [P1874246170](https://www.internalfb.com/phabricator/paste/view/P1874246170)

Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for `torch.bfloat16`, use rtol `1.6e-2` and atol `1e-5`). It shows numerical issues for some backends and kernels.

Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/7aa77ad1-83eb-41ea-a27d-50fd5b1dd6be" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/a26ec028-3791-4a41-a12a-05e10f60e9aa" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/cc6673ed-c148-4dd2-a729-5f02e717ab3e" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/f71f9f9d-7b45-4ce7-89d0-e9bce727efae" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/e012821a-b7e6-4e83-a24c-c97fa8cd37b5" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/2d52ee1e-9a8c-4bd1-a180-97b93f07171d" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/02aad056-3ce1-4b40-8cfe-adae81fd017a" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/779f6b0d-a102-4164-8300-86fff0329ddf" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158536
Approved by: https://github.com/yf225, https://github.com/eellison
2025-07-19 05:41:01 +00:00
5cde34473c Fix MakeTensor::computeStorageSize() (#158690)
For tensor with non-zero offset, it must be multiplied by element size

Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with
```
C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15
Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib)
frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib)
frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib)
frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib)
frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib)
frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib)
frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib)
frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib)
frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib)
frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib)
frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib)
frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic)
frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158690
Approved by: https://github.com/angelayi
2025-07-19 05:21:33 +00:00
fac0be7b9c [async-TP] Turn asserts back into silent skips (#158572)
https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise.

This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569.

This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572
Approved by: https://github.com/danielvegamyhre, https://github.com/atalman
2025-07-19 04:54:38 +00:00
64dabb2cf5 only fail regressions>10% on pr_time benchmarks (#158577)
Moving to a new framework, maintaitning the pr_time benchmark test right now is hard and often breaking.
1. only fail PRs >10% regressions.
2. post monitor with pr_time benchmarks dashboard (oncall), and update expected results (frequently or on big changes)
(supposed to already be doing https://www.internalfb.com/unidash/dashboard/pt2_diff_time_metrics)
3. setting up some one detections  detectors warnings that would be triggered at regressions and notify internally post land
https://www.internalfb.com/monitoring/detector/1140915271179237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158577
Approved by: https://github.com/xmfan, https://github.com/janeyx99
2025-07-19 04:35:31 +00:00
ab557421a4 [cca] [c10d] Refactor CUDAEventCache into separate files (#158616)
Summary:
Refactored CUDAEventCache from ProcessGroupNCCL.hpp/.cpp into dedicated header and implementation files for better code organization and maintainability.

Split out CUDAEventCache into:
- New header file: CUDAEventCache.hpp
- New implementation file: CUDAEventCache.cpp
- Updated build_variables.bzl to include the new file

This change improves code maintainability, readability, and follows better code organization practices.
---
> Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/)
[Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace)

Test Plan:
Verified build with:
```
buck build //caffe2/test/distributed:c10d
```
---
> Generated by [Confucius Code Assist (CCA)](https://www.internalfb.com/wiki/Confucius/Analect/Shared_Analects/Confucius_Code_Assist_(CCA)/)
[Session](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Chat), [Trace](https://www.internalfb.com/confucius?session_id=61b9029a-636b-11f0-9d9a-f1bcc55be1ce&tab=Trace)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158616
Approved by: https://github.com/fduwjj
2025-07-19 02:51:28 +00:00
90b082e207 enable_caching_generated_triton_templates=True by default (#158592)
Got some risk, but good to catch issues if there is any, easy to revert single flag flip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158592
Approved by: https://github.com/eellison
2025-07-19 02:19:34 +00:00
a741094159 Build domain libraries on the build job (#158600)
By setting the name of the domain libraries to build via `BUILD_ADDITIONAL_PACKAGES` environment variable, the build job will build them and make them available as artifacts in the same way as the PyTorch CI wheel. To ensure that this doesn't break CI, the test job will still build them as usual if the wheels are not there.  Building dependencies like FBGEMM on the test job is bad, especially for GPU jobs, because it leave the GPU resource idle

Fixes https://github.com/pytorch/pytorch/issues/152024

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158600
Approved by: https://github.com/yangw-dev
ghstack dependencies: #158598, #158599
2025-07-19 02:03:50 +00:00
2955acaed6 Clean up some unused build env variables (#158599)
* Parameter build-with-debug isn't needed, it isn't even passed into Docker. Debug build is detected via the build environment name
* AWS_DEFAULT_REGION is a leftover from ARC and isn't used anywhere in .ci/pytorch nor .github

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158599
Approved by: https://github.com/cyyever, https://github.com/ZainRizvi
ghstack dependencies: #158598
2025-07-19 01:59:00 +00:00
2c16eb9f3d [dynamo] Support more basic output types for nonstrict_trace (#157969)
Fixes #157397 and improves the user-facing error message for remaining
unsupported cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157969
Approved by: https://github.com/zou3519
2025-07-19 00:59:54 +00:00
c2c88846a9 Revert "[Easy] Show some clear error when torch.ops.load_library fails. (#157524)"
This reverts commit 555f3562541992b66a550eca8e8740884b1247f8.

Reverted https://github.com/pytorch/pytorch/pull/157524 on behalf of https://github.com/wdvr due to reverting for now to reopen the discussion ([comment](https://github.com/pytorch/pytorch/pull/157524#issuecomment-3091317252))
2025-07-19 00:45:31 +00:00
5b40f6581e Revert "Add warning about removed sm50 and sm60 arches (#158301)"
This reverts commit fb731fe371cb1b5bf95de84b19c213590526acb2.

Reverted https://github.com/pytorch/pytorch/pull/158301 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158301#issuecomment-3091307023))
2025-07-19 00:32:04 +00:00
d42c409767 [AOTI] windows package load dev (#158671)
changes:
1. add extract file fail handler for Windows develop.
2. normalize more file paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158671
Approved by: https://github.com/angelayi
2025-07-19 00:06:40 +00:00
a3aacd6cb2 [DTensor] fix copy_ strategy (#158538)
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158490
2025-07-18 23:44:43 +00:00
36bddcd18c [DTensor] Fix default_strategy and rename for clarity (#158490)
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2025-07-18 23:44:42 +00:00
15ef4f28df Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/albanD
2025-07-18 23:24:21 +00:00
60b9b06a53 [caffe2] Fix Missing override in get_buffer of NCCLSymmetricMemory (#158597)
Summary:
Fix the error that occurs in the devarm environment when compiling with Clang:
```
caffe2/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu:97:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
97 | virtual at::Tensor get_buffer(int
| ^
caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
| ^
1 error generated.
```

Test Plan:
See D78520305

Rollback Plan:

Differential Revision: D78517953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158597
Approved by: https://github.com/janeyx99
2025-07-18 23:12:29 +00:00
a835dbc096 [c10d][ez] Fix error message to reflect the correct API name (#158668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158668
Approved by: https://github.com/VieEeEw
2025-07-18 23:10:47 +00:00
f76f4abf3f Track monitor (#156907)
Tracking gpu mem allocation, we were tracking the gpu bandwidth memory, the mem allocation is the one reflect wether the gpu is oom or not, upcoming ui fix.

UI fix: https://github.com/pytorch/test-infra/pull/6878/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156907
Approved by: https://github.com/huydhn
2025-07-18 22:54:13 +00:00
be483a5481 setup pinned commit for vllm in pytorch ci (#158591)
Set up pinned commit for vllm in nightly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158591
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-18 22:30:20 +00:00
bc7b1f5252 [AOTI] Use libstdc++ only for fbcode cpu case (#158659)
Differential Revision: D78567218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158659
Approved by: https://github.com/kflu, https://github.com/zoranzhao
2025-07-18 22:27:10 +00:00
07c4c2a792 [dynamo][be] hide warnings without invalidating warnings cache (#158520)
I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning.

So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor.

There is a behavior change if someone calls a compiled graph with simplefilter("error"):
```python
# e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing
with warnings.catch_warnings():
    warnings.simplefilter("error")  # turns all warnings into errors
    compiled_fn()  # will throw if any of the muted warnings fire
```

FIXES https://github.com/pytorch/pytorch/issues/128427

A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-07-18 22:02:31 +00:00
89850bbc07 [Dynamo] Use proper sources for constructing dataclass defaults (#157993)
Partially fixes https://github.com/pytorch/pytorch/issues/154009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157993
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2025-07-18 21:51:40 +00:00
3bb729df97 Revert "Fix test consolidate hf safetensors (#157386)"
This reverts commit fa1c20ae9285f7994a73d2d06025065f96b67a57.

Reverted https://github.com/pytorch/pytorch/pull/157386 on behalf of https://github.com/jithunnair-amd due to Need to revert this so we can revert PR 156705, which introduced errors on ROCm CI. These errors were not seen on CUDA CI because CUDA CI docker images do not have safetensors installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/157386#issuecomment-3090706074))
2025-07-18 21:00:12 +00:00
e3351b3ddf Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069)"
This reverts commit 627ba411366bcc15019c49756d3f22fd3914bd50.

Reverted https://github.com/pytorch/pytorch/pull/158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/158069#issuecomment-3090692336))
2025-07-18 20:54:19 +00:00
1ab1ab38a0 Use linux.12xlarge.memory to build for H100/sm_90 (#158598)
Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100 or newer GPUs, so it doesn't benefit much from existing compiler cache from trunk. Also use a memory-intensive runner here because memory is usually the bottleneck

Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158598
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-07-18 20:31:56 +00:00
8b2a650572 pt2_remote_cache: Log sample for failures, and log the explicit reason we're faling. (#156874)
Summary: This allows us to start alerting on cache failures, based on scuba data

Test Plan:
Added new tests explicitly for the Remote Cache API.

Note that we have existing tests for memcache, but not for manifold AFAICT.

There are two potential wrinkles. One we're adding a new field (and everything uses ScubaData AFAICT, so this should just work).

The other one is the implicit api contract that if the sample is None, then it will be ignored (and not crash). I believe the second one is implemented correctly (and tested). The first one is a little more nebulous, but I think won't cause any breakages.

Also manually ran a compile and made sure it didn't break - P1851504490 as well as forcing it to break and checking we didn't screw up the exception handling - P1851504243

Rollback Plan:

Differential Revision: D77054339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156874
Approved by: https://github.com/oulgen, https://github.com/masnesral
2025-07-18 20:28:27 +00:00
ec0b538961 [inductor] Make times and repeat parameters command line args (#158590)
Summary: Small change to make the `times` and `repeat` variables controllable as command line args.

Test Plan:
Execute:
```
buck2 run <run params> <path>:inductor_benchmark -- --times=1 --repeat=1
```
Only runs once, and without passing the args it runs with default values of 10.

Rollback Plan:

Reviewed By: malfet

Differential Revision: D78458680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158590
Approved by: https://github.com/FindHao, https://github.com/malfet
2025-07-18 20:07:55 +00:00
599f94e7b9 [AOTI] add Windows file ext to package loader. (#158578)
Add `object` and `extension` file type for Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158578
Approved by: https://github.com/angelayi
2025-07-18 19:57:12 +00:00
04ac258cf6 [BE][testing] Fix test_cudacodecache.py (#158259)
Summary: According to internal test failures, looks like we're missing a check for cuda: https://fburl.com/testinfra/eznzkyha

Test Plan:c`buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158259
Approved by: https://github.com/exclamaforte, https://github.com/BoyuanFeng
2025-07-18 19:56:13 +00:00
1b5fdb23b9 [BE] Add pre-push hook for lintrunner to the PyTorch repo (#158389)
Adds a pre-commit hook (technically a pre-push hook) to the PyTorch repo.
**This is currently an opt-in feature**, which one can opt into by running `python scripts/setup_hooks.py` locally.

### Features
- **Run Lintrunner Before Push**: Before every `git push`, automatically runs lintrunner on your changes.
  - Really need to skip the checks? Run `git push --no-verify`
- **Consistent, Isolated, Lintrunner Environment**: During pre-push, Lintrunner runs in it's own virtual en environment that contain all lintrunner dependencies in a consistent, isolated environment.  No more lintrunner failures because you created a new .venv. (Did you know you needed to run `lintrunner init` every time you make a new .venv?)
- **Dependencies Automatically Updated**: If .lintrunner.toml is updated, this will automatically re-run `lintrunner init` to ensure you install the latest dependencies specified

### Installation
- Run `python scripts/setup_hooks.py`. Now every `git push` will first run lintrunner.

### Additional details
- The lintrunner used by the pre-push hook runs in a special per-repo virtual environment managed by the commit-hook tool located under `$USER/.cache/pre-commit`
- Does not affect your regularly used lintrunner
  - Manual invocations of lintrunner will continue to depend on your local environment instead of the special pre-push one. If there's enough interest, we could explore consolidating them.
- Does not run `lintrunner -a` for you.
  - You still need to manually run that (can be changed later though!)
- Have staged/unstaged changes? No worries
  - This runs `git stash` before running the pre-commit hooks and pops back your changes afterwards, so only the changes actaully being pushed will be tested

### Downsides
- No streaming UI updates
  - While you still get the same output from lintrunner that you're used to, the commit-hook framework doesn't show any output while lintrunner is actually running. Instead, it shows the entire output after linter has completed execution, which could be a few minutes (especially if it has to run `lintrunner init` first)
- `uv` installation is required to run the setup script. The setup script will ask users to install uv if it's not available.
  - This is required to be able to install the pre-commit package in a safe way that's available no matter what .venv you are running in.

### Opting out
- Disable hook for a single push: Run `git push --no-verify`
- Disable hook permanently: If something goes wrong and you need to wipe your setup:
  - Delete the `$USER/.cache/pre-commit` folder and the `.git/hooks/pre-push` file in your local repo.
  - You can now rerun `python scripts/setup_hooks.py` to setup your git push hook again if you want.

### Potential Future Changes
Things that could be done to make this even better if folks like these ideas:
- Automatic setup
  - Our `CONTRIBUTING.md` file tells devs to run `make setup-env`.  That could be a good entry point to hook the installation into
- Fix the console output streaming
- Make every lintrunner invocation (including manual ones) use the same repo-specific venv that the commit-hook uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158389
Approved by: https://github.com/seemethere
2025-07-18 19:55:35 +00:00
75e2628782 Add lower bounds for fsspec and networkx dependencies (#158565)
Fixes #156587

This sets lower bounds for fsspec and networkx in both setup.py and requirements,txt.

- fsspec>= 0.8.5 (released December 15, 2020)
- netowrkx>= 2.5.1 (released April 3, 2021)

These are the first stable versions released after Python 3.9 came out on October 5, 2020. Since Python 3.8 is no longer maintained, setting these minimums helps ensure PyTorch won't be installed alongside unexpectedly old versions of these packages.

Tested with these versions locally to make sure they don't break anything. Adding CI for lower-bound testing could be a follow up later if need.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158565
Approved by: https://github.com/janeyx99
2025-07-18 19:42:09 +00:00
79e49efadd Pull latest Sphinx theme (#158595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595
Approved by: https://github.com/albanD
2025-07-18 18:46:47 +00:00
b87e50db5e [BE][testing] Fix internal test failures in test/dynamo/test_unspec (#158485)
Summary: These tests failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.

Test Plan:
```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_object' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_values_with_graph_break' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only' --run-disabled
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158485
Approved by: https://github.com/williamwen42
2025-07-18 18:41:03 +00:00
656885b614 [Dynamo][Better Engineering] Type devices, resume_execution and testing utils (#158593)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a set of utilities in dynamo, `device_interface.py`, `resume_execution.py`, `tensor_version_ops.py`, `test_case.py`, and `test_minifier_common.py`

Running
```
mypy torch/_dynamo/device_interface.py torch/_dynamo/resume_execution.py torch/_dynamo/tensor_version_op.py torch/_dynamo/test_case.py torch/_dynamo/test_minifier_common.py  --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  976 | 1672 | 58.37% | 76 | 112 | 67.86% |
| This PR | 1719 | 1719 | 100.00% | 112 | 112 | 100.00% |
| Delta    | +743 | +47 | +41.63% | +36 | 0 | +32.14% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158593
Approved by: https://github.com/mlazos
2025-07-18 18:22:06 +00:00
6e07d6a0ff [Dynamo][Better Engineering] Add typing support for _dynamo/repro and debug_utils (#158504)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to an important set of utilities in dynamo, `repro/` and the base `debug_utils.py`

Running
```
mypy torch/_dynamo/repro/ torch/_dynamo/debug_utils.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  905 | 3268 | 27.69% | 22 | 81 | 27.16% |
| This PR | 3368 | 3368 | 100.00% | 81 | 81 | 100.00% |
| Delta    | +2463 | +100 | +72.31% | +59 | 0 | +72.84% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158504
Approved by: https://github.com/mlazos
2025-07-18 18:15:55 +00:00
b4358c5e87 [inductor] Explicitly link c10 in inductor. (#158622)
MSVC have error "unresolved external symbol" when compiling inductor. Explicitly link c10 in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158622
Approved by: https://github.com/desertfire

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-18 18:00:50 +00:00
86675af3f0 Revert "[ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)"
This reverts commit 9308261a2afb69d807ea06508bb8582b066d9ccd.

Reverted https://github.com/pytorch/pytorch/pull/158602 on behalf of https://github.com/ZainRizvi due to The lint job failure was hiding a real lint failure. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/16375911199/job/46275682191) [HUD commit link](6f73e06796) ([comment](https://github.com/pytorch/pytorch/pull/158602#issuecomment-3090209891))
2025-07-18 17:46:11 +00:00
725cdb218e Name threads in caffe2/torch/distributed/checkpoint AsyncCheckpointExecutor (#158612)
Differential Revision: D78493333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158612
Approved by: https://github.com/d4l3k
2025-07-18 17:33:12 +00:00
8c3f84908b [aot] fix greater_than_max build fail on Windows. (#158479)
Error snapshot:
<img width="937" height="110" alt="image" src="https://github.com/user-attachments/assets/10195f84-83c4-42db-af3c-76f875a6a983" />

Reason:
`std::numeric_limits::max` is confilct to windef.h:`max(a, b)`

Fix code:
<img width="488" height="269" alt="image" src="https://github.com/user-attachments/assets/3328c37b-7c89-435e-944c-4ca7c9b6c5b6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158479
Approved by: https://github.com/desertfire
2025-07-18 17:18:10 +00:00
6f73e06796 [iter] exhaust ListIterator when unpack_var_sequence is called (#156370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156370
Approved by: https://github.com/zou3519
ghstack dependencies: #156369
2025-07-18 16:48:27 +00:00
acffd1a297 [iter] Update some of the tests to not call pickle (#156369)
Some tests in test_iter only fail because of pickle. I'm skipping the pickle section as Dynamo doesn't support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156369
Approved by: https://github.com/zou3519
2025-07-18 16:48:27 +00:00
bf4aa78279 Revert "[DTensor] Fix default_strategy and rename for clarity (#158490)"
This reverts commit d8b084312b54e97bdbaf6a178fe2fc628a23243b.

Reverted https://github.com/pytorch/pytorch/pull/158490 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](d8b084312b) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))
2025-07-18 16:45:32 +00:00
50f33a6fca Revert "[DTensor] fix copy_ strategy (#158538)"
This reverts commit 7b05bdd925f0f4b49e68662f9761fabaa27f2faf.

Reverted https://github.com/pytorch/pytorch/pull/158538 on behalf of https://github.com/clee2000 due to broke lint? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16361950974/job/46231492581) [HUD commit link](d8b084312b) ([comment](https://github.com/pytorch/pytorch/pull/158490#issuecomment-3090042448))
2025-07-18 16:45:32 +00:00
35df895d05 [AOTI] package loader normalize path separator (#158630)
Add `normalize_path_separator` to handle Windows path simplify.

This solution is working well on `torch/_inductor/cpp_builder.py`: a00cd8cf25/torch/_inductor/cpp_builder.py (L406-L409)

Let's copy it to package loader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158630
Approved by: https://github.com/angelayi
2025-07-18 15:55:24 +00:00
193b29ee0c [BE][EZ] Minor doc fixes (#158574)
[BE] Minor doc fixes
2025-07-18 10:34:55 -05:00
036eb1f65d [precompile] Filter out ID_MATCH family of guards with caching_precompile. (#158368)
Summary: For case like caching_precompile, we almost always want to drop ID_MATCH-type guards since they will block serialization. This diff add this behavior when this global flag is toggled on so that ID_MATCH guards are excluded from compilation and serialization.

Test Plan:
test_dynamo -- -k test_id_match_with_config

Rollback Plan:

Differential Revision: D78363609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158368
Approved by: https://github.com/jamesjwu
2025-07-18 14:47:11 +00:00
e882c761dd Add STD_TORCH_CHECK to headeronly (#158377)
Differential Revision: [D78366519](https://our.internmc.facebook.com/intern/diff/D78366519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158377
Approved by: https://github.com/albanD
2025-07-18 14:35:20 +00:00
0eae6b68f4 Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)
Fixes #158376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537
Approved by: https://github.com/atalman
2025-07-18 14:05:52 +00:00
a4ec381302 [build] pin setuptools>=77 to enable PEP 639 (#158104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158104
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2025-07-18 11:49:54 +00:00
27af877f84 [ATen][CUDA][SDPA] Flash Attention: Refactor sm version checks (#158558)
The architecture version checks are unnecessary fine-grained in PyTorch. Considering the fact that PyTorch's Flash Attention works on all `sm_80+` machines, it makes more sense to just check for lower bound.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158558
Approved by: https://github.com/eqy
2025-07-18 09:59:41 +00:00
7b05bdd925 [DTensor] fix copy_ strategy (#158538)
The previous strategy directly used 'self' input strategy for 'src'
input.  The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.

E.g. for this program, broadcasting will occur on dims 0,1,3 of self.

```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```

These are the correct sharding combinations:

|   self   |     src |
|-------|------|
| Shard(0)  |   Replicate() |
| Shard(1)  |   Replicate() |
| Shard(2)  |   Shard(0) |
| Shard(3)  |   Shard(1) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
2025-07-18 09:59:37 +00:00
ead80f3202 Fix s390x CI: ensure that all python dependencies are installed when … (#158552)
…building pytorch for tests on s390x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158552
Approved by: https://github.com/huydhn
2025-07-18 09:13:41 +00:00
32aade9d8d Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit 39ac189808c61588f3594dbc2fc1d69bb6194c47.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/jithunnair-amd due to Ignored ROCm failures while ROCm was unstable, but HUD clearly shows this PR introduced failures on trunk ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3087982975))
2025-07-18 07:47:46 +00:00
be896d6b41 Revert "Forward-fix unused variables warning/error (#158549)"
This reverts commit eeda1a75ace75ce8a6763050fb91d236a6d3287b.

Reverted https://github.com/pytorch/pytorch/pull/158549 on behalf of https://github.com/jithunnair-amd due to Sorry, need to revert this first, so we can revert PR 158037, which broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/158549#issuecomment-3087942475))
2025-07-18 07:44:14 +00:00
a3396a9b85 [hop] set capture_scalar_outputs=True by default for compiled hops (#158480)
We want to do it for two reasons:
1. It's tedious for users to manually turn on capture_scalar_outputs=True when compiling map and scan with inductor, where we decomposing them into while_loop and use the idx tensor.item() to select a slice of output buffer and write into it. This pr turns on the flag by default.
2. a graph break caused by capture_scalar_outputs=False would cause the hop to fail, and we should turn it on by default so that the error message is more meaningful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158480
Approved by: https://github.com/zou3519
2025-07-18 07:16:50 +00:00
fda3f3b2ec [while_loop] fix constant tensor used as carried inputs (#158381)
Address second part of #158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158381
Approved by: https://github.com/zou3519
2025-07-18 07:08:11 +00:00
a00cd8cf25 Add a way to disable compile for debugging flex-attention (#158534)
Finally got around to doing this, this flag lets us do:

```Python

#!/usr/bin/env python3
"""
FlexAttention Debug: Using breakpoints and unwrap
"""

import torch
import torch.nn.attention.flex_attention as fa

unwrap = torch._C._functorch.get_unwrapped

def score_mod(score, batch, head, q_idx, kv_idx):
    # Set breakpoint here to debug
    breakpoint()

    # In debugger, unwrap to see actual tensor values:
    # >>> actual_score = unwrap(unwrap(unwrap(unwrap(score))))
    # >>> actual_batch = unwrap(batch)
    # >>> actual_head = unwrap(head)
    # >>> actual_q_idx = unwrap(q_idx)
    # >>> actual_kv_idx = unwrap(kv_idx)
    # >>> print(actual_score)
    # >>> print(f"q_idx: {actual_q_idx}, kv_idx: {actual_kv_idx}")

    return torch.where(q_idx >= kv_idx, score, torch.tensor(float('-inf')))

def main():
    # Enable debug mode
    fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = True

    # Small example
    B, H, S, D = 1, 2, 4, 8
    q = torch.randn(B, H, S, D)
    k = torch.randn(B, H, S, D)
    v = torch.randn(B, H, S, D)

    # Run - will hit breakpoint
    output = fa.flex_attention(q, k, v, score_mod=score_mod)

    # Disable debug mode
    fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = False

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158534
Approved by: https://github.com/Chillee, https://github.com/zou3519
2025-07-18 05:33:45 +00:00
eb73650723 [BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.

For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from  the `check_pyobj` call.

```
 mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
-    // fast case: tensor is live in python
-    std::optional<PyObject*> mb_obj =
-        t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
-    if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
-        return *mb_obj;
-    }
-    return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+  // fast case: tensor is live in python
+  std::optional<PyObject*> mb_obj =
+      t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+          /*ignore_hermetic_tls=*/false);
+  if (mb_obj.has_value() &&
+      !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+    return *mb_obj;
+  }
+  return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
2025-07-18 05:23:00 +00:00
9308261a2a [ROCm][CI] update fbgemm_gpu hash used by inductor tests (#158602)
fbgemm_gpu build started failing with asmjit errors.  Moving to latest tip of fbgemm for inductor tests resolves the build failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158602
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-18 05:02:31 +00:00
9a7c2f1f64 Revert "Add torch compile force disable caches alias (#158072)"
This reverts commit 2ecf083b7247f265a03ec296ba9d7b795f035118.

Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))
2025-07-18 04:58:24 +00:00
d8b084312b [DTensor] Fix default_strategy and rename for clarity (#158490)
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
  by mixing input_specs that were frozen from
  select_strategy.strategies[0] with output_specs that varied across
  select_strategy.strategies[0..N] (e.g. we could create a nonsense
  strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
  should be the cost of redistributing our single input from another
  strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
  input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
  tensor's strategies directly onto 'src' tensor without accounting for
  the fact that copy_ supports broadcasting a smaller rank tensor into a
  larger one.

Separates out copy_  op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR

Renames to `propagate_single_input_strategy` since that's more
descriptive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #158495
2025-07-18 04:09:32 +00:00
1e86fa2e5b Add stack trace to Inductor IR nodes if inductor.config.trace.provenance_tracing=True (#158576)
Summary:
- Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and  ` create_node_mapping_kernel_to_post_grad`
- Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace`
- Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node
- When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to.
  - A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](a00442421a/torch/_inductor/fx_passes/group_batch_fusion.py (L299))

Example:

```
scheduling ExternKernelOut(
  python_kernel_name='extern_kernels.mm',
  name=buf0,
  layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]),
  inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView(
    StorageBox(
      ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1]))
    ),
    FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]),
    origins=OrderedSet([mm_default_1]),
    stack_traces = {,
    File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
        x = self.fc1(x),
      File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
        return F.linear(input, self.weight, self.bias),
    }
  )],
  constant_args=(),
  kwargs={},
  output_view=None,
  python_kernel_name=extern_kernels.mm,
  cpp_kernel_name=at::mm_out,
  ordered_kwargs_for_cpp_kernel=(),
  op_overload=None,
  arg_properties=[{}, {}],
  allarg_properties={},
  kwarg_properties=None,
  unbacked_bindings={},
  mutation_outputs=[],
  origin_node=mm_default_1,
  origins=OrderedSet([mm_default_1]),
  stack_traces = {,
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
      x = self.fc1(x),
    File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
      return F.linear(input, self.weight, self.bias),
  }
)
```

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Rollback Plan:

Differential Revision: D78365534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158576
Approved by: https://github.com/angelayi
2025-07-18 04:05:17 +00:00
86dbc0ef67 [NativeRT] Remove makeProxyExecutor from ModelRunner interface (#158587)
Summary: makeProxyExecutor shouldn't be exposed to ModelRunner Interface.

Test Plan:
CI

Rollback Plan:

Differential Revision: D78501011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158587
Approved by: https://github.com/yiming0416, https://github.com/henryoier
2025-07-18 03:20:40 +00:00
89d842fec5 Make torch.distributed.breakpoint() set a long timeout (#158481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158481
Approved by: https://github.com/d4l3k
ghstack dependencies: #158469
2025-07-18 02:18:43 +00:00
ce4554352b Shunt fx_interpreter graphmodule print on error into tlparse (#158469)
Include both the error stacktrace and the graphmodule in a new
structured trace artifact.  Log the shortened version to the console,
and also log a hint to look at the tlparse for more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158469
Approved by: https://github.com/ezyang
2025-07-18 02:18:43 +00:00
583138d170 [Dynamo][Better Engineering] Add typing for comptime, cache, and convert_frame (#158379)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, primarily for`comptime.py` but also `cache_size.py` and `convert_frame.py`.

Running
```
mypy torch/_dynamo/comptime.py torch/_dynamo/cache_size.py torch/_dynamo/convert_frame.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1837 | 2215 | 82.93% | 45 | 82 | 54.88% |
| This PR | 2230 | 2230 | 100.00% | 82 | 82 | 100.00% |
| Delta    | +393 | +15 | +17.07% | +37 | 0 | +45.12% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158379
Approved by: https://github.com/mlazos
2025-07-18 02:11:57 +00:00
eqy
6fd6fc418d [B200] Fix flex-attention heuristic for test_tma_with_customer_kernel_options_cuda (#158494)
Otherwise fails with
```
torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused__to_copy_ones_sort_sum_zeros_2 Required: 264224 Hardware limit: 232448 Reducing block sizes or `num_stages` may help.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158494
Approved by: https://github.com/drisspg
2025-07-18 02:03:49 +00:00
ddbecdfb66 [DTensor] Document redistribute_costs (#158495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158495
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-18 01:43:38 +00:00
ef38edb284 Add stride check for attn_mask on non-cpu device (#158424)
Fixes #158374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158424
Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman
2025-07-18 01:10:58 +00:00
6673ac746c Fix test linalg for MKL upgrading (#158312)
Fixes #158054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158312
Approved by: https://github.com/albanD
2025-07-18 01:08:33 +00:00
7b72e5b3ad Fix Pandas version mismatch upon reinstalling numpy (#158584)
If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place.
```
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list
Package            Version
------------------ -----------
attrs              25.3.0
build              1.2.2.post1
certifi            2025.7.14
charset-normalizer 3.4.2
cmake              4.0.3
exceptiongroup     1.3.0
expecttest         0.3.0
filelock           3.18.0
fsspec             2025.5.1
hypothesis         6.135.32
idna               3.10
importlib_metadata 8.7.0
Jinja2             3.1.6
lintrunner         0.12.7
MarkupSafe         2.1.5
mpmath             1.3.0
networkx           3.2.1
ninja              [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4)
opt-einsum         3.3.0
optree             0.16.0
packaging          25.0
pip                25.1
psutil             7.0.0
pyproject_hooks    1.2.0
python-dateutil    2.9.0.post0
pytz               2025.2
PyYAML             6.0.2
requests           2.32.4
setuptools         78.1.1
six                1.17.0
sortedcontainers   2.4.0
sympy              1.14.0
tomli              2.2.1
typing_extensions  4.14.0
tzdata             2025.2
urllib3            2.5.0
uv                 0.7.21
wheel              0.45.1
zipp               3.23.0
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4
Collecting numpy==1.22.4
  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.22.4
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3
Collecting pandas==2.0.3
  Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4)
Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
Installing collected packages: pandas
Successfully installed pandas-2.0.3
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2
Collecting numpy==2.0.2
  Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
Successfully installed numpy-2.0.2
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python
Python 3.9.23 (main, Jun  5 2025, 13:40:20)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158584
Approved by: https://github.com/huydhn
2025-07-18 00:14:16 +00:00
33c9b414aa [CI][MPS] Enable test_indexing on MPS (#158582)
- Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with
```
/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
```
while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')`

- Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet)
- Adjust `test_index` to avoid using DoubleTensors for MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158582
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales
2025-07-17 23:33:52 +00:00
b0e325c2c8 [Dynamo][Better Engineering] Add type coverage to decorators (#158509)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to an important file in dynamo, `decorators.py`

NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time

Running
```
mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  209 | 908 | 23.02% | 9 | 39 | 23.08% |
| This PR | 870 | 943 | 100.00% | 36 | 39 | 100.00% |
| Delta    | +661 | +35 | +76.98% | +27 | 0 | +76.92% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158509
Approved by: https://github.com/williamwen42
2025-07-17 23:31:26 +00:00
f63988ae00 [BE]Clean up old APIs in AOTI c shim (#158400)
Summary:
The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h`

This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h`

The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py`

Test Plan:
CI

Rollback Plan:

Differential Revision: D78378373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158400
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-07-17 23:24:50 +00:00
2df2e3bb51 [ROCm][CI] Last known good HIP patch (#158596)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158596
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-17 22:52:16 +00:00
0ecfb93a0b Avoid globally modifying torch.testing._internal.common_methods_invocations.wrapper_set_seed (#158548)
Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158548
Approved by: https://github.com/janeyx99
2025-07-17 22:31:59 +00:00
74f4cf4bd5 Add missing <vector> in c10/util/WaitCounter.h (#158354)
It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158354
Approved by: https://github.com/janeyx99
2025-07-17 22:23:05 +00:00
cyy
1b91954b9f Suppress volatile type error (#158435)
Fixes
```
/var/lib/jenkins/workspace/torch/csrc/dynamo/guards.cpp:5320:10:
error: compound assignment to object of volatile-qualified type 'volatile char' is deprecated [-Werror,-Wdeprecated-volatile]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158435
Approved by: https://github.com/janeyx99
2025-07-17 22:21:04 +00:00
41b2c4d119 Reduce random reads for offset metadata when calling torch.load under FakeTensorMode (#157931)
We already test the `_get_offset` functionality with that TORCH_SERIALIZATION_DEBUG flag that is set in CI, so I didn't add more testing specifically for FakeTensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157931
Approved by: https://github.com/albanD
2025-07-17 22:17:52 +00:00
af6624023e [dynamo] Skip training flag check id already guarding on nn modules (#158492)
This might help some legacy models that still have
inline_inbuilt_nn_modules False for some reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158492
Approved by: https://github.com/StrongerXi
2025-07-17 21:42:19 +00:00
a00442421a [CI][TD] Enable TD on all test configs (#158163)
I think the main one that was missing is dynamo_wrapped

There's also slow and inductor, but the filter later for workflows stops TD from running on those anyways

dynamo_wrapped is the second longest jobs for pull right now
<img width="1265" height="311" alt="image" src="https://github.com/user-attachments/assets/d4ca034c-a8f0-4b31-a80f-0f4f21fce32a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158163
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2025-07-17 21:05:25 +00:00
ced5cf042d Revert "Cleanup old caffe2 scripts (#158475)"
This reverts commit 94d7f0c1ef9a4cb4db0eb5d6b1ffc55941cbeab1.

Reverted https://github.com/pytorch/pytorch/pull/158475 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158475#issuecomment-3085447409))
2025-07-17 20:58:34 +00:00
1b88da1cac [MPS] Improve performance of max_pool3d (#157875)
To check how the changes from this PR affect performance, I wrote a script here: 55ef32a127/max_pool_mps/perf.py.

Before this PR, I get this:

```
===================
max_pool3d
===================
0: 0.013105 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2}
1: 0.038003 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5}
2: 0.212963 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5}
3: 1.224645 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5}
4: 7.317867 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1}
5: 34.679233 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20}
6: 34.626383 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1}
7: 44.835892 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40}
8: 0.083579 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2}
9: 0.936575 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2}
10: 5.329883 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2}
11: 11.713617 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2}
12: 25.450454 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2}
13: 0.058375 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2}
14: 3.757558 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2}
15: 33.451588 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2}
```

After this PR, I get this:

```
===================
max_pool3d
===================
0: 0.007202 ms, max_pool3d, (3, 2, 2, 2), {'kernel_size': 2}
1: 0.018596 ms, max_pool3d, (3, 10, 10, 10), {'kernel_size': 5}
2: 0.130717 ms, max_pool3d, (3, 100, 100, 100), {'kernel_size': 5}
3: 0.966795 ms, max_pool3d, (3, 200, 200, 200), {'kernel_size': 5}
4: 4.095804 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 4, 'padding': 1}
5: 12.833446 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20}
6: 12.859346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1}
7: 14.080529 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 50, 'padding': 20, 'dilation': 1, 'stride': 40}
8: 0.029283 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2}
9: 0.175700 ms, max_pool3d, (10, 10, 30, 30, 30), {'kernel_size': 2}
10: 0.742750 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2}
11: 1.939596 ms, max_pool3d, (10, 10, 70, 70, 70), {'kernel_size': 2}
12: 4.074821 ms, max_pool3d, (10, 10, 90, 90, 90), {'kernel_size': 2}
13: 0.028425 ms, max_pool3d, (10, 10, 10, 10, 10), {'kernel_size': 2, 'dilation': 2}
14: 0.384375 ms, max_pool3d, (10, 10, 50, 50, 50), {'kernel_size': 2, 'dilation': 2}
15: 2.623346 ms, max_pool3d, (10, 10, 100, 100, 100), {'kernel_size': 2, 'dilation': 2}
```

Every case is improved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157875
Approved by: https://github.com/malfet
2025-07-17 20:34:12 +00:00
66c9bc5062 [export] Add runnable code to export docs (#158506)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html

Yay I can add runnable code to export docs now
Also moved export API reference to a different file.

With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506
Approved by: https://github.com/pianpwk, https://github.com/svekars
2025-07-17 20:15:22 +00:00
80ac73c057 [ca] reset between tests (#158418)
CA reset is much faster than dynamo reset, so it's probably okay to run it every time. I'm not sure if this will fix the flaky autograd tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158418
Approved by: https://github.com/jansel
2025-07-17 20:14:29 +00:00
eeb0783fe6 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-17 20:04:42 +00:00
ef256ad17b Make Inductor imports TYPE_CHECKING only (#158524)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158524
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-17 19:55:19 +00:00
fd51bcdd21 check if USE_ROCM is defined (#158571)
Summary:
check if USE_ROCM is defined

D78424375 broke some builds: see T231304402

Test Plan:
rerunning failed builds

Rollback Plan:

Reviewed By: Camyll

Differential Revision: D78493019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158571
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-07-17 19:48:26 +00:00
7ebbf2cae7 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563) (#158550)
This reverts commit 8554c8007ddaa8029e7e01bb1af12f358bf597c2 #157563 due to causing a few breakages on ROCm

Reverted expected_results.csv to 26807dcf277feb2d99ab88d7b6da526488baea93

> @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: f4d8bc46c7/2
and the failures are still showing up on tip-of-tree on HUD

Context
https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857

Needs to be relanded in non bc-breaking way, or sanity checked for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-17 19:47:41 +00:00
8dcebaa7b0 [AOTI] add WIN32 implement for create_temp_dir (#158570)
add Windows implement for `create_temp_dir`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158570
Approved by: https://github.com/angelayi
2025-07-17 19:22:59 +00:00
7e34f9c292 Add torch._C._log_api_usage_once to datapipes (mapper) (#155489)
This is to get a better understanding of how datapipes is used right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155489
Approved by: https://github.com/ramanishsingh
2025-07-17 19:01:49 +00:00
25f4d7e482 Use new type statement to fix public API of types (#158487)
Since type statement breaks older python version, trying to find equivalent behavior without the type mechanics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158487
Approved by: https://github.com/andrewor14
2025-07-17 18:46:44 +00:00
ad223a6c5f Add FP8 Types (#158430)
Summary: Add FP8 Types

Test Plan:
sandcastle

Rollback Plan:

Differential Revision: D78395110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158430
Approved by: https://github.com/henryoier
2025-07-17 18:09:56 +00:00
f92a2035e4 ci: Update lint workflow to only run on changed files for PRs (#158518)
This modifies the lint workflow to use the new get-changed-files
workflow to optimize lint execution by only running on files
that have actually changed in pull requests.

This more closely mirrors the type of behavior that users
expect when running lint locally on their PRs.

This also leaves the default behavior as a fallback for when
you're not running on a pull request.

Since lint runs on the pull_request event I'm not really worried about
any type of ciflow shenanigans in this.

This also splits mypy into its own job since mypy needs to run on all-files all the time.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158518
Approved by: https://github.com/huydhn
ghstack dependencies: #158517
2025-07-17 18:00:44 +00:00
bff69f25c2 [BE][testing] fix test/dynamo/test_repros:test_longtensor_list (#158458)
Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458
Approved by: https://github.com/jansel
2025-07-17 17:27:00 +00:00
6d31d38965 recovering node source from dict (#158373) (#158473)
Summary:

this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78434648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158473
Approved by: https://github.com/angelayi
2025-07-17 17:00:19 +00:00
bfe5674e22 Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info  ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))
2025-07-17 16:55:55 +00:00
94d7f0c1ef Cleanup old caffe2 scripts (#158475)
Testing on this one is grep based: if there were no reference to that script I can find, I deleted.
We can easily add any of these back if needed!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158475
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/cyyever
2025-07-17 16:50:06 +00:00
23550ab735 Revert "DDE-Free select with unbacked index. (#157605)"
This reverts commit 79d7c754ab8ae0e5c3a614521632d2cfbfa0fdba.

Reverted https://github.com/pytorch/pytorch/pull/157605 on behalf of https://github.com/laithsakka due to fail pr time benchmarks  ([comment](https://github.com/pytorch/pytorch/pull/157605#issuecomment-3084663020))
2025-07-17 16:20:02 +00:00
16b21fa8b2 [AOTI] skip ld and objcopy on Windows. (#158545)
Skip `ld` and `objcopy` on Windows. They are not support on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158545
Approved by: https://github.com/desertfire
2025-07-17 15:43:24 +00:00
2ecf083b72 Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-07-17 15:40:36 +00:00
813c76b98d Revert "Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)"
This reverts commit 58c7cf9ede6311da5533dbcaf238a912176a6a85.

Reverted https://github.com/pytorch/pytorch/pull/158537 on behalf of https://github.com/albanD due to This broke C++ tests ([comment](https://github.com/pytorch/pytorch/pull/158537#issuecomment-3084425920))
2025-07-17 15:06:43 +00:00
288bf54a23 Revert "Move off of deprecated API in 2.9 (#158527)"
This reverts commit 9636e2cfd3e995ef977f670ad47e8e895296d992.

Reverted https://github.com/pytorch/pytorch/pull/158527 on behalf of https://github.com/albanD due to breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/158527#issuecomment-3084385585))
2025-07-17 14:55:28 +00:00
da4c7b4ced [AOTI] align signature to model_base.h (#158554)
Remove `const` keyword, align its signature to `model_base.h` eeda1a75ac/torch/csrc/inductor/aoti_runtime/model_base.h (L51-L53)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158554
Approved by: https://github.com/desertfire
2025-07-17 14:44:32 +00:00
a04bd11895 [AOTI] Use format_consts_to_cpp on Windows. (#158543)
`format_consts_to_asm` is not supported on Windows, force use `format_consts_to_cpp` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158543
Approved by: https://github.com/desertfire
2025-07-17 14:40:34 +00:00
58c7cf9ede Unify torch.tensor and torch.ops.aten.scalar_tensor behavior (#158537)
Fixes #158376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158537
Approved by: https://github.com/atalman
2025-07-17 13:39:25 +00:00
38c04415a9 [oss][hf][bug fix] Remove buggy consolidation logic (#158380)
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D78366701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
2025-07-17 13:05:06 +00:00
7892f5a007 [inductor][triton] Update HAS_WARP_SPEC to check triton.Config params. Update Triton Hash to top of release/3.4.x stack (#158459)
Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton.

Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459
Approved by: https://github.com/atalman
2025-07-17 12:50:46 +00:00
d5af0eca8d [BE][3/5] fix typos in aten/ (aten/src/ATen/native/) (#157552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157552
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637, #157550, #157551
2025-07-17 12:08:34 +00:00
f57ef62ebc [BE][2/5] fix typos in aten/ (aten/src/ATen/native/) (#157551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157551
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637, #157550
2025-07-17 12:08:33 +00:00
4c8b408d16 [BE][1/5] fix typos in aten/ (#157550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157550
Approved by: https://github.com/albanD
ghstack dependencies: #156605, #157637
2025-07-17 12:08:33 +00:00
c8d43cbc6e [BE][3/6] fix typos in test/ (#157637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157637
Approved by: https://github.com/yewentao256, https://github.com/albanD
ghstack dependencies: #156605
2025-07-17 12:08:33 +00:00
3f8e2e91ad [BE][15/16] fix typos in torch/ (torch/distributed/tensor/) (#156605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156605
Approved by: https://github.com/wanchaol, https://github.com/albanD
2025-07-17 12:08:33 +00:00
eeda1a75ac Forward-fix unused variables warning/error (#158549)
Introduced in https://github.com/pytorch/pytorch/pull/158037, didn't seem to trigger on PR, but trunk CI is failing in some `linux-jammy-cpu-py3.12-gcc11-inductor-*` jobs where this warning is turned into an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158549
Approved by: https://github.com/danthe3rd
2025-07-17 09:44:19 +00:00
f4d8bc46c7 Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520)
### Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.

```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```

Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-17 08:57:34 +00:00
39ac189808 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-17 08:26:27 +00:00
d76323d417 [NativeRT] Remove normalizeDevice (#158489)
Summary:
In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0).

tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index.

TBEPermute is relying on this behavior to route output tensor to a device specified by current thread.

For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement.

Device-to-device mapping must be done between concrete device!

Test Plan:
CI

Rollback Plan:

Differential Revision: D78443109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489
Approved by: https://github.com/henryoier
2025-07-17 06:48:25 +00:00
04349f9ee5 [PT2]: Skip AOTI Weight Loading during Init (#158416)
Summary: AOTI already has weights embedded in .so file. So for the initial load, no need to load the weights again. This allows lowered modules can have different set of weights on different hardwares.

Test Plan:
```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=895279202
SNAPSHOT_ID=0
MODULE=merge

buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName ${MODULE} --predictor-hardware-type 1 --submodToDevice ""  --benchmarkDontRebatchSamples=true --benchmarkNumIterations 1000
```

Rollback Plan:

Differential Revision: D78383881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158416
Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad
2025-07-17 06:47:47 +00:00
09db3a22e8 [BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453)
BUILD_SPLIT_CUDA logic has been removed for a while

Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453
Approved by: https://github.com/albanD
ghstack dependencies: #158358, #158365
2025-07-17 06:47:10 +00:00
a38f433be2 [Docker builds] Move from Miniconda to Miniforge (#158370)
This is related to: https://www.anaconda.com/legal/terms/terms-of-service

Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799

Rocm and XPU builds since they use Miniforge are not affected

```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
 > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93     • https://repo.anaconda.com/pkgs/main
11.93     • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93     ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.

Using solution 2.

Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``

4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-07-17 06:33:08 +00:00
9f37cce693 Revert "[Docker builds] Move from Miniconda to Miniforge (#158370)"
This reverts commit 0a99b026d6bd0f67dc2c0a20fe3228ddc4144854.

Reverted https://github.com/pytorch/pytorch/pull/158370 on behalf of https://github.com/laithsakka due to this fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/158370#issuecomment-3082744071))
2025-07-17 06:28:49 +00:00
9636e2cfd3 Move off of deprecated API in 2.9 (#158527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158527
Approved by: https://github.com/danielvegamyhre
2025-07-17 06:18:13 +00:00
d9426a81d2 [BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290, #158291
2025-07-17 05:56:26 +00:00
0b9fb91f17 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-17 05:56:26 +00:00
a6de309ca1 [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-17 05:56:18 +00:00
1a4268b811 [BE] remove torch deploy - conditionals (#158288)
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always

Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
2025-07-17 05:56:07 +00:00
79d7c754ab DDE-Free select with unbacked index. (#157605)
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
2025-07-17 05:08:11 +00:00
415dfabe9b [Easy] Fix the format (#158450)
When I modify the code located in test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg,
some unrelated format error occurred.

```Python
Lint for torch/_inductor/fx_passes/fuse_attention.py:

  Error (CODESPELL) spelling error
    Failed due to ValueError:
    /pytorch/pytorch/torch/_inductor/fx_passes/fuse_attention.py:587: differnt
    ==> different

    Please either fix the error or add the word(s) to the dictionary file.
    HINT: all-lowercase words in the dictionary can cover all case variations.

Lint for torch/fx/traceback.py:

  Error (MYPY) [assignment]
    Incompatible types in assignment (expression has type "str", variable has
    type "None")

        101  |
        102  |    def _get_action_string(self):
        103  |        if self._action_string is None:
        104  |            self._action_string = "+".join([a.name.lower() for a in self.action])
        105  |        return self._action_string
        106  |
        107  |    def print_readable(self, indent=0):

  Error (MYPY) [assignment]
    Incompatible types in assignment (expression has type "dict[str, Any]",
    variable has type "None")

        121  |        if self._dict is None:
        122  |            # Convert the object to a dictionary
        123  |            action_string = self._get_action_string()
        124  |            self._dict = {
        125  |                "name": self.name,
        126  |                "target": self.target,
        127  |                "graph_id": self.graph_id,

  Error (MYPY) [return-value]
    Incompatible return value type (got "None", expected "dict[Any, Any]")

        130  |                "from_node": [node.to_dict() for node in self.from_node],
        131  |            }
        132  |
        133  |        return self._dict
        134  |
        135  |    def __eq__(self, other: object):
        136  |        if not isinstance(other, NodeSource):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158450
Approved by: https://github.com/Skylion007
2025-07-17 04:56:10 +00:00
8eaa9f2701 Fix mask construction when dispatching index_put to masked_fill (#158472)
Fixes #158413
Previously trailing Nones in the index were incorrectly handled as implicit broadcasting dims in the mask, whereas they should just be ignored.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158472
Approved by: https://github.com/ezyang
2025-07-17 04:21:43 +00:00
ebf83b8b77 [audio hash update] update the pinned audio hash (#158402)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158402
Approved by: https://github.com/pytorchbot
2025-07-17 04:19:06 +00:00
24b49b9881 [Fix] Rework CUDA error explanation framework to be less destructive … (#158484)
…in fbsource

Fix-forward for #158395

Added `std::string c10::cuda::get_cuda_error_help(const char* error_string)` to provide a framework for appending clarifying messages to CUDA errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158484
Approved by: https://github.com/aorenste
2025-07-17 03:36:47 +00:00
1839e8d04b [DTensor] Assert DTensorSpec has valid placements (#158133)
This helped identify buggy sharding rules during debugging, why not
check it in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158133
Approved by: https://github.com/XilunWu, https://github.com/zpcore
ghstack dependencies: #158132
2025-07-17 02:32:26 +00:00
2ad5c25cfc Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-07-17 01:56:01 +00:00
1179e33323 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-07-17 01:56:01 +00:00
f6d138807f Always disable ShardingPropagation cache if compiling (#156868)
Fixes #151106

Addresses issue (2) in #152963 for the DTensor sharding propagation cache being brittle under compile. The existing `_are_we_tracing` from `distributed._functional_collectives`, which mostly determines if currently tracing based on Fake Tensor dispatch mode, is reused here.

**Test Plan**:
There are already tests for DTensor + Compile with dynamic shape ([test_dtensor_dynamic](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L260),
[test_dynamo_dtensor_from_local_dynamic_shapes](https://github.com/pytorch/pytorch/blob/main/test/distributed/tensor/test_dtensor_compile.py#L402)) that cover the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156868
Approved by: https://github.com/xmfan
2025-07-17 01:33:53 +00:00
c09eba877f [Device] Add support for PrivateUse1 device type in parse_type function (#157609)
This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency.

This help to migrate existed backend to PrivateUse1 smoothly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609
Approved by: https://github.com/jgong5, https://github.com/albanD
2025-07-17 01:27:44 +00:00
2179afd714 [easy][guards] Add developer comment for posterity (#158471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158471
Approved by: https://github.com/StrongerXi
2025-07-17 01:17:04 +00:00
d7e1b8b11d [dynamo] Constant fold torch.autograd._profiler_enabled (#158482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158482
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-07-17 01:07:42 +00:00
b6454a9058 [AOT_inductor] model_base.h add Windows include files. (#158477)
model_base.h add Windows include files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158477
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-07-17 00:57:48 +00:00
e9367a7a42 ci: Add reusable workflow to get changed files in PRs (#158517)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158517
Approved by: https://github.com/huydhn
2025-07-17 00:57:43 +00:00
clr
e78f2ac92b inductor: Fix crash in split_cat when tensors is a Node (#157155)
If there is only one node passed to aten::cat, the argument is a single node,
rather than a list of nodes with a valid length.

Example stack
```
  File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/pattern_matcher.py", line 1115, in apply
    self.handler(match, *match.args, **match.kwargs)
  File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/fx_passes/split_cat.py", line 1786, in merge_split_cat_aten
    if len(cat_inputs) < threshold_to_cat:
torch._inductor.exc.InductorError: TypeError: object of type 'Node' has no len()
```

This has failed about 7 internal jobs in the last week, running pytorch trunk code from 06/15

I've attached a test which reproduces this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157155
Approved by: https://github.com/jansel
2025-07-17 00:57:38 +00:00
82a1ee1135 Refactor Provenance Tracking (#158399)
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`

- change the guard flag from `trace.enabled` to `trace.provenance_tracking`.  It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.

In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.

See more motivation in internal Diff

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D78287976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
2025-07-17 00:23:00 +00:00
306dd19216 update expeced results (#158497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158497
Approved by: https://github.com/xmfan
2025-07-17 00:02:52 +00:00
1d58476162 [PP] Add eval() API to schedule (#157795)
These change add an `eval()` API to PP schedules

## Context

Currently, you can run "Forward only" for a schedule in two ways:
1. Use a custom schedule `_ScheduleForwardOnly`
2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed.

However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this.

```python
if self.rank == 0:
    schedule.eval(x)
elif self.rank == self.world_size - 1:
    losses = []
    schedule.eval(target=target, losses=losses)
else:
    schedule.eval()
```

TODO:
- in later PRs, we will deprecate the `_ScheduleForwardOnly`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795
Approved by: https://github.com/wconstab
2025-07-16 23:48:45 +00:00
a4d753295e [Dynamo][Better Engineering] Add enhanced typing support to _dynamo/eval_frame.py (#158276)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to the main entrypoint for dynamo, `eval_frame.py`

Running
```
mypy torch/_dynamo/eval_frame.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  623 | 2232 | 27.91% | 19 | 68 | 27.94% |
| This PR | 2285 | 2285 | 100.00% | 68 | 68 | 100.00% |
| Delta    | +1662 | +63 | +72.09% | +49 | 0 | +72.06% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158276
Approved by: https://github.com/williamwen42

Co-authored-by: William Wen <williamwen@meta.com>
2025-07-16 23:31:10 +00:00
a9f902add0 [CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295)
Reopen https://github.com/pytorch/pytorch/pull/156097

Fixes https://github.com/pytorch/pytorch/issues/154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-16 23:14:36 +00:00
e311886e3d Add transpose to torch/csrc/stable (#158160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158160
Approved by: https://github.com/janeyx99
2025-07-16 22:50:57 +00:00
3cb11877aa [aoti][mps] Enable test_aot_inductor.py tests (#155598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155598
Approved by: https://github.com/yushangdi
2025-07-16 22:26:57 +00:00
5951fcd50a [Dynamo][Better Engineering] Support typing in codegen.py (#158386)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, primarily for `codegen.py` but also `config.py`

Running
```
mypy torch/_dynamo/codegen.py torch/_dynamo/config.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  347 | 1330 | 26.09% | 24 | 50 | 48.00% |
| This PR | 1334 | 1334 | 100.00% | 50 | 50 | 100.00% |
| Delta    | +987 | +4 | +73.91.% | +26 | 0 | +52.00% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158386
Approved by: https://github.com/StrongerXi
2025-07-16 22:09:01 +00:00
ada44e5ba7 [Dynamo][Better Engineering] Add typing to bytecode analysis and transform (#158293)
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to a critical tracing point for dynamo, `bytecode_transformation.py` and by extension, `bytecode_analysis.py`

Running
```
mypy torch/_dynamo/bytecode_transformation.py torch/_dynamo/bytecode_analysis.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  1422 | 1920 | 74.06% | 73 | 93 | 78.49% |
| This PR | 1968 | 1968 | 100.00% | 93 | 93 | 100.00% |
| Delta    | +546 | +48 | +25.94% | 20 | 0 | +21.51% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158293
Approved by: https://github.com/StrongerXi, https://github.com/Skylion007
2025-07-16 21:50:55 +00:00
9df0176408 [BE][testing] Disable test_static_cuda_launcher:test_floats internally (#158296)
Summary: it seems the check for 'Offd' vs. 'Offf' doesn't work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158296
Approved by: https://github.com/davidberard98
2025-07-16 21:27:40 +00:00
94c746bb43 [DTensor][BE] add document to ShardingPropagator.register_op_strategy (#158362)
**Summary**
Add document to `ShardingPropagator.register_op_strategy` on how to draft
`strategy_func` and when to use `schema_info`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158362
Approved by: https://github.com/zpcore
2025-07-16 21:08:59 +00:00
473208cb18 [ez][lint] Add pr_time_benchmarks to merge conflictless csv linter (#158353)
Discovered this when looking at a PR I was trying to revert and was surprised that the PR got rid of the spaces but didn't trigger the linter.  Turns out the file was following the rule but wasn't actually being checked
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158353
Approved by: https://github.com/seemethere, https://github.com/Camyll
2025-07-16 20:31:07 +00:00
fb731fe371 Add warning about removed sm50 and sm60 arches (#158301)
Related to https://github.com/pytorch/pytorch/issues/157517

Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.

Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
    Found <GPU Name> which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is 7.0.

  warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
                        Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
                        Please see https://github.com/pytorch/pytorch/issues/157517
                        Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301
Approved by: https://github.com/nWEIdia, https://github.com/albanD
2025-07-16 20:11:18 +00:00
a9ee4250d5 [4/n] Remove references to TorchScript in PyTorch docs (#158317)
Summary: jit.rst

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158317
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-16 20:01:34 +00:00
14ecc03361 Revert "recovering node source from dict (#158373)"
This reverts commit 4d055982e38f59fdb2a4c9d8855e58548bc42c12.

Reverted https://github.com/pytorch/pytorch/pull/158373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158373#issuecomment-3080093479))
2025-07-16 19:55:21 +00:00
1cc62c2cb9 [export] Update docs (#157750)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html

Changes:
* Rename draft_export.md -> export.draft_export.md for consistency.
* Removed non-strict section in export, instead pointed to programming model doc.
* Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs.
* Removed Specialization section in favor of programming model doc
* Added pt2 archive doc
* Cleaned up sidebar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750
Approved by: https://github.com/pianpwk
2025-07-16 19:53:12 +00:00
f58a680d09 [c10d]Prototype of remote_group_merge (#158287)
Tentative implementation of merge_remote_group per the proposal here: [docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89](https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158287
Approved by: https://github.com/d4l3k
ghstack dependencies: #157716
2025-07-16 19:33:57 +00:00
944a140e90 Revert "[cuda][cupy] Improve cupy device placement when device is provided (#158320)"
This reverts commit 59f9b25f3cfc635053843372ea29ff4bf754da3f.

Reverted https://github.com/pytorch/pytorch/pull/158320 on behalf of https://github.com/wdvr due to reverting because most likely causing test/test_numba_integration.py::TestNumbaIntegration::test_from_cuda_array_interface_inferred_strides to fail ([comment](https://github.com/pytorch/pytorch/pull/158320#issuecomment-3079960616))
2025-07-16 19:15:33 +00:00
cyy
79ab84e9b8 Fix invalid formatting (#158436)
It causes errors under C++20
```
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:330:40:
error: call to consteval function 'fmt::fstring<>::fstring<std::string, 0>' is not a constant expression
```
Indeed the printed value is treated as format string and it may contain special chars in some cases. While this is not true in our case, it can't be determined in compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158436
Approved by: https://github.com/Skylion007
2025-07-16 18:47:09 +00:00
2b0f9b1f61 Move c10/macros/Macros.h to headeronly (#158365)
^

Differential Revision: [D78361893](https://our.internmc.facebook.com/intern/diff/D78361893/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158365
Approved by: https://github.com/swolchok
ghstack dependencies: #158358
2025-07-16 18:46:52 +00:00
b40f48d191 Move the rest of c10/macros/Export.h (#158358)
Differential Revision: [D78356975](https://our.internmc.facebook.com/intern/diff/D78356975/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158358
Approved by: https://github.com/swolchok
2025-07-16 18:46:52 +00:00
4d055982e3 recovering node source from dict (#158373)
Summary: this diff recovers NodeSource object from its dict representation, which is crucial for NodeSource serde.

Test Plan:
ci

Rollback Plan:

Differential Revision: D78363882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158373
Approved by: https://github.com/yushangdi
2025-07-16 18:46:09 +00:00
bc9091a524 Fix indexing with multi-dimensional boolean mask (#158369)
Fixes #71673

This fixes a bug in PyTorch indexing, that shows up when mixing multi-dimensional boolean masks with other forms of indexing. Examples:
```python
>>> import torch
>>> x = torch.ones([2, 2, 3])
>>> m = torch.tensor(((True, False), (False, False)))  # (2x2 boolean mask)

>>> x[m].shape  # this works fine (the boolean mask acts on the 2x2 subspace selecting one row)
torch.Size([1, 3])

>>> x[m, 0]  # this should produce a tensor of shape (1,)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 3] at index 1

>>> x[m, ::2]  # this should produce a tensor of shape (1, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 3] at index 1

>>> x[m, None]  # this should produce a tensor of shape (1, 1, 3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 2, 3] at index 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158369
Approved by: https://github.com/ngimel
2025-07-16 18:30:57 +00:00
a26bf38927 Don't need to handle PyTrace_EXCEPTION in pyProfileFn (#154392)
According to the [document](https://python.readthedocs.io/fr/stable/c-api/init.html#c.PyTrace_EXCEPTION) and [comment](https://github.com/python/cpython/blob/3.9/Modules/_lsprof.c#L407), we don't need to handle PyTrace_EXCEPTION in pyProfileFn.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154392
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 18:00:11 +00:00
da05b7fb94 [cond] add _FlopCounterMode support for cond (#158067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158067
Approved by: https://github.com/zou3519
ghstack dependencies: #158077
2025-07-16 17:26:20 +00:00
82b1c48292 [hop] add supports_higher_order_operators flag to TorchDispatchMode (#158077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158077
Approved by: https://github.com/zou3519
2025-07-16 17:26:20 +00:00
a369350065 enable compiled autograd on CPU windows (#158432)
compiled autograd on windows is disabled in PR #144707 because cuda windows cannot compile this code.
However these code can be compiled on CPU. This PR enable these code on CPU windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158432
Approved by: https://github.com/jansel, https://github.com/xmfan

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-07-16 17:22:37 +00:00
ff611d971f [ROCm] check stream graph capture status in memcpy_and_sync inline function (#158165)
Check for stream graph capture when using hipMemcpyWithStream.

Fixes https://github.com/pytorch/pytorch/issues/155684, https://github.com/pytorch/pytorch/issues/155231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158165
Approved by: https://github.com/jeffdaily
2025-07-16 17:17:34 +00:00
4805a6ead6 [aot][XPU] switch xpu to use consts cpp build. (#158425)
Intel compiler is not support `format_consts_to_asm`, let's use `format_consts_to_cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158425
Approved by: https://github.com/jansel
2025-07-16 16:19:33 +00:00
a8b9736737 [BE][testing] disable test_custom_op_square internally (#158367)
Summary: test is failing with `ld.lld: error: unable to find library -laoti_custom_ops`

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:test_aot_inductor_custom_ops -- --exact 'caffe2/test/inductor:test_aot_inductor_custom_ops - test_custom_op_square_cuda (caffe2.test.inductor.test_aot_inductor_custom_ops.AOTInductorTestABICompatibleCuda)' --run-disabled`

Differential Revision: [D78364617](https://our.internmc.facebook.com/intern/diff/D78364617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158367
Approved by: https://github.com/desertfire
2025-07-16 16:16:14 +00:00
4b11428cb5 [BE][testing] Skip test_repeated_masked_load internally (#158355)
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
2025-07-16 16:15:44 +00:00
a04a13c449 [BE][testing] Skip test_triton_interpret internally (#158260)
Summary: Subprocesses in fbcode are tricky because of .par files. I'm thinking it's not an important enough test to get it running and skipping is fine.

Test Plan: `buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158260
Approved by: https://github.com/eellison
2025-07-16 16:14:44 +00:00
a23f4471b9 [ROCm][Windows] Fix finding ROCm/HIP version (#156486)
This commit fixes Windows build issue related to trying to use rocm-core (rocm-core doesn't exist on HIP SDK)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156486
Approved by: https://github.com/jeffdaily, https://github.com/stellaraccident
2025-07-16 15:31:43 +00:00
06a67a8948 Fix sha256 for aotriton ROCm7.0 tarball (#158420)
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
       file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
    /var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
  does not match expected value
    expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
      actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
  Each download failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
2025-07-16 15:24:20 +00:00
9513b9d03f Revert "Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)"
This reverts commit bc65253369933160a2da3fc786d027a572faf6b7.

Reverted https://github.com/pytorch/pytorch/pull/158037 on behalf of https://github.com/lw due to OSX failures are real ([comment](https://github.com/pytorch/pytorch/pull/158037#issuecomment-3079042171))
2025-07-16 15:04:10 +00:00
0b19d463d9 forward fix lint (#158448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158448
Approved by: https://github.com/adamomainz
2025-07-16 14:55:33 +00:00
5763ec5f8d [BE] Replace lib with TORCH_INSTALL_LIB_DIR (#158235)
Their values are actually the same. Just staying in line with other `INSTALL` commands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158235
Approved by: https://github.com/Skylion007
ghstack dependencies: #158234
2025-07-16 14:20:19 +00:00
2043f6911e [BE] Rename libnvshmem_extension to libtorch_nvshmem (#158234)
`libnvshmem_extension.so` creates an illusion that it is a shared library from NVSHMEM. But indeed it is built from torch source code, for symmetric tensor infrastructure and operations, though leveraging NVSHMEM APIs. Thus this PR renames `libnvshmem_extension.so` to `libtorch_nvshmem.so`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158234
Approved by: https://github.com/albanD
2025-07-16 14:20:19 +00:00
bc65253369 Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037)
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.

The scaling format is still detected from the sizes of the scale tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
2025-07-16 13:54:09 +00:00
51a708ffc6 [nativert] libtorch kernel registry (#157150)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77451703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157150
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
2025-07-16 12:36:55 +00:00
55d888a616 Add framework for explanations for common CUDA errors (#158395)
As popularly requested in user groups.

Test plan:
```
import torch

a = torch.randn(10000)
device = torch.device('cuda:1')
a = a.to(device)
```

Before:
```
Traceback (most recent call last):
  File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
    a = a.to(device)
        ^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

After:
```
Traceback (most recent call last):
  File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
    a = a.to(device)
        ^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-07-16 12:31:18 +00:00
0a99b026d6 [Docker builds] Move from Miniconda to Miniforge (#158370)
This is related to: https://www.anaconda.com/legal/terms/terms-of-service

Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799

Rocm and XPU builds since they use Miniforge are not affected

```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
 > [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93     • https://repo.anaconda.com/pkgs/main
11.93     • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93     ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.

Using solution 2.

Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``

4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-07-16 10:52:47 +00:00
ac706bfc7f disable multi kernel rocm (#158299)
Fixes https://github.com/pytorch/pytorch/issues/158274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158299
Approved by: https://github.com/huydhn
2025-07-16 10:20:09 +00:00
9d184bda2f add device generalization support for distributed tests (#156796)
MOTIVATION
To generalize Distributed test cases for non-CUDA devices

CHANGES

- test/distributed/checkpoint/test_fsspec.py
- test/distributed/checkpoint/test_state_dict.py
- test/distributed/test_multi_threaded_pg.py

Replaced hard coded device names with torch.accelerator.current_accelerator

- torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py

support for hccl backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-07-16 09:37:03 +00:00
ea74fdd24a [Inductor][Triton] Update TMA Compatibility Requirements (#157881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157881
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-07-16 09:31:44 +00:00
e71bb021b9 Add a periodic test for older NVIDIA driver (#158300)
This is needed because of the botched landing of https://github.com/pytorch/pytorch/pull/156097 which crashed on older NVIDIA drivers `525.*`.  I add a periodic job to install the `525.105.17` on CI, then run:

1. A smoke to make sure that CUDA can be initialized
2. And the whole the test suite on the older driver
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158300
Approved by: https://github.com/ngimel
2025-07-16 08:18:18 +00:00
fb9a5d248f Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297)
Fixes #141563

In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case.

This difference in behavior leads to a bug when using torch.compile:
```python
>>> import numpy as np
>>> f = lambda x: x[:,(0,1),...,(0,1)].shape
>>> a = np.ones((3, 4, 5))
>>> f(a)
(2, 3)
>>> torch.compile(f)(a)
(3, 2)
```

Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified.

Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples:
- #71673
- #107699
- #158125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297
Approved by: https://github.com/soumith
2025-07-16 08:11:53 +00:00
ddf502c988 [AOTI] add -lstdc++ into aoti link cmd for Meta internal (#158325)
Differential Revision: D78123716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158325
Approved by: https://github.com/desertfire
2025-07-16 07:55:08 +00:00
555f356254 [Easy] Show some clear error when torch.ops.load_library fails. (#157524)
**Background**:

```Shell
torch       2.5.1+cpu
torchvision 0.20.1
```

```Python
import torch
import torchvision

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
    def meta_nms(dets, scores, iou_threshold):
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
  File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```

**Cause**:

```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
2025-07-16 07:33:22 +00:00
59f9b25f3c [cuda][cupy] Improve cupy device placement when device is provided (#158320)
This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen.

I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy.

I will leave it to pytorch developers to see if the improvement is worthwhile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320
Approved by: https://github.com/ezyang
2025-07-16 07:12:36 +00:00
fedbd1a48e Enable ROCm 7.0 Alpha docker builds for PyTorch CI (#158390)
This PR adds ROCm 7.0 alpha docker builds to start testing latest ROCm in PyTorch CI and enable new MI350x hardware.

Highlights:
* Stop building `pytorch-linux-jammy-rocm-n-1-py3` docker images, as they're not currently used in any CI workflows
* Add `pytorch-linux-noble-rocm-alpha-py3` docker images that will use ROCm alpha (newer than latest official release) builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158390
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-16 06:09:37 +00:00
5484890539 Add better typing to avaialbe kernel options for flex attention (#158383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158383
Approved by: https://github.com/joydddd, https://github.com/BoyuanFeng
2025-07-16 06:06:29 +00:00
61a7b09ef3 [BE][Easy] split build system requirements.txt to a separate file (#158111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158111
Approved by: https://github.com/ezyang
2025-07-16 05:03:30 +00:00
e92e3eaf4e [Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586)
Hi team,

Please help review this trivial fix.

Without this change:

``` python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

    capture_overload_names (bool) : whether to include ATen overload names in the profile
```

With this change:

```python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None

An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed.
    profiler_metrics : a list of CUPTI profiler metrics used
       to measure GPU performance events.
       If this list contains values Kineto runs in CUPTI profiler mode
    profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
       or for the entire measurement duration.
    verbose (bool) : whether the trace file has `Call stack` field or not.
    performance_events : a list of profiler events to be used for measurement.
    enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events
       that expose CUDA device, stream and event synchronization activities. This feature is new
       and currently disabled by default.
    adjust_profiler_step (bool) : whether to adjust the profiler step to
       match the parent python event duration. This feature is new and currently disabled by default.
    disable_external_correlation (bool) : whether to disable external correlation
    profile_all_threads (bool) : whether to profile all threads
    capture_overload_names (bool) : whether to include ATen overload names in the profile

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-07-16 04:10:49 +00:00
0a9d450168 [DTensor] implement histc (#158298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158298
Approved by: https://github.com/zpcore, https://github.com/XilunWu
2025-07-16 04:10:32 +00:00
e265b719bd Extract out prepare_aot_module_simplified for use in next PR (#158319)
Also a small amount of extra code cleanup.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158319
Approved by: https://github.com/jingsh
ghstack dependencies: #158149, #158150, #158173, #158176, #158213, #158251
2025-07-16 03:59:41 +00:00
7637c9718a Move functions from torch._functorch.aot_autograd that are not frontend functions to frontend_utils (#158251)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158251
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176, #158213
2025-07-16 03:59:41 +00:00
49d0332cef Introduce stages to aot_dispatch (#158213)
The starting point for this refactor is that I need access to the fully
general joint graph representation in an export-like interface, but I
then subsequently need a way to feed this joint graph into the rest of
the compilation pipeline so I can get an actual callable that I can run
once I've finished modifying it.  Previously, people had added export
capabilities to AOTAutograd by having an export flag that toggled what
exactly the functions return and triggering aot_dispatch to go to a
different "export" implementation, but I've found this difficult to
understand and has lead to a bit of duplicate code for the export path.

So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call:
  * create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls:
    * One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which:
      * Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture
      * Do some base/export/autograd specific post-processing on the graph

Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to *resume* the rest of the post-processing to actually get a callable.

Here is the new structure:

* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow:
  * Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure
  * Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this.
  * Call aot_stage1_graph_capture to do the graph capture
  * Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want

With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176
2025-07-16 03:59:32 +00:00
84dec060b7 Hoist choose_dispatcher to top level, remove unnecessary returns (#158176)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158176
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173
2025-07-16 03:56:25 +00:00
5b0df2565e Pipeline _create_aot_dispatcher_function (#158173)
Two main things of note:

- Review this diff without whitespace changes
- To ensure that context managers correctly propagate to later pipeline
  stages, I am using the ExitStack trick: there is an ExitStack which is
  in scope for the entire pipeline, and inside of the individual
  pipeline stages we push context managers onto this stack when we want
  them to survive into the next pipeline stage.  This is not obviously
  what the best final form of the code is, but
  create_aot_dispatcher_function is called from multiple locations so I
  can't just inline the context managers into the call site.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149, #158150
2025-07-16 03:56:25 +00:00
0cb36e2d62 cache dict and string rep for better perf (#158372)
Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf.

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78298501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372
Approved by: https://github.com/yushangdi
2025-07-16 02:15:32 +00:00
584a0510b3 [inductor] fix windows path for fresh cache. (#158324)
`normalize_path_separator` for windows path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158324
Approved by: https://github.com/jansel
2025-07-16 01:54:35 +00:00
9768d393fa add sfdp pattern (#155792)
add sfdp pattern for MBartForCausalLM/PLBartForCausalLM in transformers==4.44.2.
Improve the inference performance of these model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155792
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-07-16 01:52:05 +00:00
900fba4c07 Update warning of TF32 (#158209)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158209
Approved by: https://github.com/jansel
2025-07-16 01:28:50 +00:00
03852ddc22 Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (#156903)"
This reverts commit 1ea9cde598ead20194dbb6c5cb26e74e36e6ad55.

Reverted https://github.com/pytorch/pytorch/pull/156903 on behalf of https://github.com/atalman due to Breaks torchao and torchtitan nightly builds ([comment](https://github.com/pytorch/pytorch/pull/156903#issuecomment-3076423488))
2025-07-16 01:28:46 +00:00
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
651b4a68f2 [hop][dynamo] track run-ahead sym variables in side effects (#158273)
Before the PR, for code like this:
```
        class Example2(torch.nn.Module):
            def forward(self, x, trigger, target):
                return torch.cond(
                    trigger == 1,
                    lambda: x + target,
                    lambda: x * target,
                    (),
                )

        m = Example2()
        x = torch.randn(2)
        trigger = 0
        target = 2
        args = (x, trigger, target)
        ep = torch.export.export(
            m, args, dynamic_shapes=(None, Dim.DYNAMIC, Dim.DYNAMIC)
        )
```
dynamo will wrap "target" (i.e. a symInt) twice, once when we speculate the first lambda and find target is a symint and decides to wrap it up, creating a new SymNodeVariable and a placeholder input to the top-level graph.

The second time happens when we speculate the second lambda. Tensors are de-duplicated by checking tracked side effects to make sure object with the same id (though different sources) is mapped to the same TensorVaraible. For symints, two things are missing:
1. it's not in the _can_lift_attrs_to_input list (the change in builder.py)
2. it's not in the tracked by runahead_side_effects, so when speculate_subgraph finishes, they're discarded (the change in side_effects.py)

Note: the auto lifting mechanism for HOPs happens at proxy level when we trace the subgraph, which is after SymNodeVariable are created (they're created when realizing the args and bind them to subgraph). At that time, builder has created two unique SymNodeVariable for the same symint so the auto lifting in hops cannot de-dup them.

Differential Revision: [D78298163](https://our.internmc.facebook.com/intern/diff/D78298163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158273
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-07-15 23:48:20 +00:00
144965ca9a [BE][S538760] get rid of TORCH_CHECK_.* and CHECK macros (#158269)
Summary: check will be crit, causing program to exit, which is quite dangerous

Test Plan:
CI

Rollback Plan:

Differential Revision: D78050595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158269
Approved by: https://github.com/SherlockNoMad, https://github.com/henryoier
2025-07-15 22:04:12 +00:00
ee0992871c Add test for user-managed weights with load_state_dict (#157496)
Summary:
Adds a unit test to verify that when 'user_managed=True' is passed to 'update_constant_buffer', the compiled AOTI model properly shares parameter storage with the eager model.

The test specifically covers the following:
1. Passes model weights to the AOTI model with 'user_managed=True''.
2. Updates the eager model weights using 'load_state_dict()', which performs in-place
3. Asserts that the compiled AOTI model reflects the updated weights, confirming shared memory behavior.

Fixes: #157474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157496
Approved by: https://github.com/desertfire
2025-07-15 21:17:24 +00:00
05dfd312cf [3/n] Remove references to TorchScript in PyTorch docs (#158315)
Summary:
- cpp_index.rst
- fx.md
- jit_builtin_functions.rst
- jit_python_reference.md
- jit_unsupported.md

cpu_threading
large_scale_deployment

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158315
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 21:14:18 +00:00
abeae997a3 Use brew suggested miniconda install command (#158347)
Use ```brew install --cask miniconda``` as specified by https://formulae.brew.sh/cask/miniconda

Forward fix After: https://github.com/pytorch/pytorch/pull/156898#issuecomment-3074207175

Seeing in CI:
```
Run if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
==> Caveats
Please run the following to setup your shell:
  conda init "$(basename "${SHELL}")"

Alternatively, manually add the following to your shell init:
  eval "$(conda "shell.$(basename "${SHELL}")" hook)"

==> Downloading https://repo.anaconda.com/miniconda/Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
Already downloaded: /Users/ec2-user/Library/Caches/Homebrew/downloads/2e356e8b147647692e4da77ce4c0c14eefee65ec86f29cc7e8c21a26ac9397ca--Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh
==> Installing Cask miniconda
==> Running installer script 'Miniconda3-py313_25.5.1-0-MacOSX-arm64.sh'
PREFIX=/opt/homebrew/Caskroom/miniconda/base
Unpacking payload ...
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working...
done
entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
installation finished.
==> Linking Binary 'conda' to '/opt/homebrew/bin/conda'
🍺  miniconda was successfully installed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158347
Approved by: https://github.com/seemethere
2025-07-15 21:08:25 +00:00
3f83e3eeca [ONNX] Remove legacy registration and dispatcher (#158283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158283
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262, #158282
2025-07-15 21:00:49 +00:00
0640cfa38c [2/n] Remove references to TorchScript in PyTorch docs (#158306)
Summary: Removed jit_language_reference.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 20:57:23 +00:00
e4c17d5e1c [ONNX] Remove fx_onnx_interpreter.py (#158282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158282
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
ghstack dependencies: #158258, #158262
2025-07-15 20:46:06 +00:00
cc0faeb80f [dynamo][guards] Instruction count for guard eval for development work (#158214)
Its turned off  by default. Even the code is hidden before of the define preprocessing flag. It will be used only for development work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158214
Approved by: https://github.com/StrongerXi
ghstack dependencies: #158215
2025-07-15 20:29:23 +00:00
205241a0d5 [ONNX] Remove legacy dynamo graph extractor (#158262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158262
Approved by: https://github.com/justinchuby
ghstack dependencies: #158258
2025-07-15 20:21:49 +00:00
19625daf88 [1/n] Remove references to TorchScript in PyTorch docs (#158305)
Summary: Removed jit_language_reference_v2.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158305
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-07-15 20:16:53 +00:00
dbf7d421da [BE][testing] fix aot_inductor_package internally (#158270)
Summary: We have internal test failure for several aot_inductor_package tests. It looks like we're translating args like:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor/script.ld
```

To:
```
-Wl,--script=/home/slarsen/local/fbsource2/buck-out/v2/gen/fbcode/7ce8f48f92bc4ee6/caffe2/test/inductor/__aot_inductor_package__/aot_inductor_package#link-tree/torch/_inductor//tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

This PR changes to strings like:
```
-Wl,--script=/tmp/jZMktZ/tmpsqoxb_cq/data/aotinductor/model/script.ld
```

Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:aot_inductor_package --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158270
Approved by: https://github.com/desertfire
2025-07-15 20:15:18 +00:00
b86d5cef68 [dynamo][tensor] Skip HASATTR attribute on tensor guards (#158215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158215
Approved by: https://github.com/StrongerXi
2025-07-15 20:10:47 +00:00
30587195d3 Migrate c10/macros/cmake_macros.h.in to torch/headeronly (#158035)
Summary: As above, also changes a bunch of the build files to be better

Test Plan:
internal and external CI

did run buck2 build fbcode//caffe2:torch and it succeeded

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D78016591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158035
Approved by: https://github.com/swolchok
2025-07-15 19:52:59 +00:00
250ae2531c Fix types in graphs.py (#158192)
Added type annotations for torch/cuda/graphs.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158192
Approved by: https://github.com/oulgen
2025-07-15 19:49:38 +00:00
011026205a make node source hashable (#158322)
Summary: as title

Test Plan:
ci

Rollback Plan:

Reviewed By: yushangdi

Differential Revision: D78296410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158322
Approved by: https://github.com/yushangdi
2025-07-15 19:31:00 +00:00
4657a84bc5 [Optimus][fp8_activation_quantization] Only log when there's some node to be quantized (#158129)
Summary:
We add some extra check on whether there's some node has been marked as should quantize, otherwise we skip the quantizaton and tlparse log.

Rollback Plan:

Differential Revision: D78173788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158129
Approved by: https://github.com/Skylion007, https://github.com/avicizhu
2025-07-15 19:22:26 +00:00
5606c516fd [ONNX] Remove legacy Dort (#158258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-07-15 19:14:06 +00:00
7afb834f93 Inline dispatch_and_compile into its call site. (#158150)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158150
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149
2025-07-15 19:08:55 +00:00
148789ddd8 Avoid AOTAutogradCache.load in stack trace on cache miss path (#158149)
The general context for the upcoming stack of commits is I am attempting
to "pipeline" AOTAutograd.  Instead of having function f call function g
which is the next "stage" of compilation, instead f should return with
its outputs, which are then piped to g for the next stage.  This will
make it easier to implement early exit / resume pipeline without forcing
callback structure, which is good for export-style use cases.  It also
reduces the size of our stack traces, which makes tools like Perfetto
happy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158149
Approved by: https://github.com/jamesjwu
2025-07-15 19:08:55 +00:00
3beb915004 Update CODEOWNERS for dataloading (#158348)
Adding Scott

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158348
Approved by: https://github.com/scotts, https://github.com/janeyx99
2025-07-15 19:06:18 +00:00
cf3247b74a Standalone compile API in _Exporter (#158139)
Given an `package: _ExportPackage`, users can get a ready-to-use workspace in `tmp_dir` by calling:
```python
package._compiled_and_package(
                tmp_dir + "/pt2_pacakge_name.pt2", True, package_example_inputs = True
            )
```

`tmp_dir` will contains:
- `main.cpp` (an example cpp file that create the models, if package_example_inputs is True, it'll also load the example inputs and run the models)
- `CMakeLists.txt`
- `pt2_pacakge_name/` (this is where the models are)
- `pt2_pacakge_name.pt2`
- `inputs.pt` files if package_example_inputs is True

Remaining TODOs
- support loading contants/weights
- the `package_example_inputs = True` option only supports a list of Tensors for now
- eventually we should remove the `torch` dependency, and use `SlimTensor`/`StableIValue` instead.

Test Plan:
```
python test/inductor/test_aot_inductor_package.py  -k test_compile_with_exporter
```

Example generated `main.cpp`:

```cpp
#include <dlfcn.h>
#include <fstream>
#include <iostream>
#include <memory>
#include <torch/torch.h>
#include <vector>
#include <torch/csrc/inductor/aoti_torch/tensor_converter.h>
#include "package/data/aotinductor/Plus__default/Plus__default.h"
#include "package/data/aotinductor/Minus__default/Minus__default.h"

using torch::aot_inductor::AOTInductorModelPlus__default;
using torch::aot_inductor::AOTInductorModelMinus__default;
using torch::aot_inductor::ConstantHandle;
using torch::aot_inductor::ConstantMap;

int main(int argc, char* argv[]) {
    std::string device_str = "cpu";
    try {
        c10::Device device(device_str);
        // Load input tensors for model Plus__default
        std::vector<at::Tensor> input_tensors1;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Plus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors1.push_back(ivalue.toTensor().to(device));
        }

        // Load input tensors for model Minus__default
        std::vector<at::Tensor> input_tensors2;
        for (int j = 0; j < 2; ++j) {
            std::string filename = "Minus__default_input_" + std::to_string(j) + ".pt";
            std::ifstream in(filename, std::ios::binary);
            if (!in.is_open()) {
                std::cerr << "Failed to open file: " << filename << std::endl;
                return 1;
            }
            std::vector<char> buffer((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
            torch::IValue ivalue = torch::pickle_load(buffer);
            input_tensors2.push_back(ivalue.toTensor().to(device));
        }

// Create array of input handles
        auto input_handles1 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors1);
        auto input_handles2 =
            torch::aot_inductor::unsafe_alloc_new_handles_from_tensors(input_tensors2);

// Create array for output handles
        AtenTensorHandle output_handle1;
        AtenTensorHandle output_handle2;

// Create and load models
        auto constants_map1 = std::make_shared<ConstantMap>();
        auto constants_array1 = std::make_shared<std::vector<ConstantHandle>>();
        auto model1 = AOTInductorModelPlus__default::Create(
            constants_map1, constants_array1, device_str,
            "package/data/aotinductor/Plus__default/");
        model1->load_constants();
        auto constants_map2 = std::make_shared<ConstantMap>();
        auto constants_array2 = std::make_shared<std::vector<ConstantHandle>>();
        auto model2 = AOTInductorModelMinus__default::Create(
            constants_map2, constants_array2, device_str,
            "package/data/aotinductor/Minus__default/");
        model2->load_constants();

// Run the models
        torch::aot_inductor::DeviceStreamType stream1 = nullptr;
        model1->run(&input_handles1[0], &output_handle1, stream1, nullptr);
        torch::aot_inductor::DeviceStreamType stream2 = nullptr;
        model2->run(&input_handles2[0], &output_handle2, stream2, nullptr);

// Convert output handles to tensors
        auto output_tensor1 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle1, 1);
        auto output_tensor2 =
            torch::aot_inductor::alloc_tensors_by_stealing_from_handles(&output_handle2, 1);

// Validate outputs
        std::cout << "output_tensor1" << output_tensor1 << std::endl;
        std::cout << "output_tensor2" << output_tensor2 << std::endl;
        return 0;
    } catch (const std::exception &e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
}

```

Rollback Plan:

Differential Revision: D78124705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158139
Approved by: https://github.com/desertfire
2025-07-15 18:47:56 +00:00
46915b1361 Revert "Introduce AcceleratorAllocatorConfig as the common class (#149601)"
This reverts commit 1e8e9f745e43fa38bbfc7b67b30bc66c0e7ebbd6.

Reverted https://github.com/pytorch/pytorch/pull/149601 on behalf of https://github.com/huydhn due to See https://github.com/pytorch/pytorch/pull/149601#discussion_r2208325379 ([comment](https://github.com/pytorch/pytorch/pull/149601#issuecomment-3074965720))
2025-07-15 18:40:59 +00:00
8c3f206457 Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158117)
This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above.

Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ).

We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12`

Fixes #157626

fixes these tests failures when pytorch is built in GCC12 and above
```
test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117
Approved by: https://github.com/malfet
2025-07-15 18:26:38 +00:00
705 changed files with 21402 additions and 21555 deletions

View File

@ -2,7 +2,7 @@ build --cxxopt=--std=c++17
build --copt=-I.
# Bazel does not support including its cc_library targets as system
# headers. We work around this for generated code
# (e.g. c10/macros/cmake_macros.h) by making the generated directory a
# (e.g. torch/headeronly/macros/cmake_macros.h) by making the generated directory a
# system include path.
build --copt=-isystem --copt bazel-out/k8-fastbuild/bin
build --copt=-isystem --copt bazel-out/darwin-fastbuild/bin

View File

@ -78,319 +78,45 @@ elif [[ "$image" == *linter* ]]; then
DOCKERFILE="linter/Dockerfile"
fi
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
fi
PY_HARDCODED_CONFIG_SCRIPT=$(python3 get_config.py --image "$image")
tag=$(echo $image | awk -F':' '{print $2}')
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$tag" in
pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11)
CUDA_VERSION=12.4
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9)
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-onnx)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
VISION=yes
ONNX=yes
;;
pytorch-linux-jammy-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.11-clang12)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.3
if [[ $? -eq 0 ]]; then
eval "$PY_HARDCODED_CONFIG_SCRIPT"
else
echo "[Fallback] Python script failed or no match — fallback to hardcoded shell case"
# Catch-all for builds that are not hardcoded.
VISION=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then
extract_version_from_image_name py ANACONDA_PYTHON_VERSION
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-noble-rocm-n-py3)
if [[ $tag =~ "jammy" ]]; then
ANACONDA_PYTHON_VERSION=3.10
else
ANACONDA_PYTHON_VERSION=3.12
# To ensure that any ROCm config will build using conda cmake
# and thus have LAPACK/MKL enabled
fi
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.4
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.0
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-xpu-2025.1-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.1
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
KATEX=yes
TRITON=yes
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=12.8.1
CUDNN_VERSION=9
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
VISION=yes
KATEX=yes
TRITON=yes
DOCS=yes
UNINSTALL_DILL=yes
;;
pytorch-linux-jammy-py3-clang12-executorch)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=12
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
HALIDE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
TRITON_CPU=yes
;;
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
VISION=yes
CONDA_CMAKE=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
VISION=yes
CONDA_CMAKE=yes
OPENBLAS=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
INDUCTOR_BENCHMARKS=yes
;;
*)
# Catch-all for builds that are not hardcoded.
VISION=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then
extract_version_from_image_name py ANACONDA_PYTHON_VERSION
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
NINJA_VERSION=1.9.0
TRITON=yes
# To ensure that any ROCm config will build using conda cmake
# and thus have LAPACK/MKL enabled
fi
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
fi
if [[ "$image" == *gcc* ]]; then
extract_version_from_image_name gcc GCC_VERSION
fi
if [[ "$image" == *clang* ]]; then
extract_version_from_image_name clang CLANG_VERSION
fi
if [[ "$image" == *devtoolset* ]]; then
extract_version_from_image_name devtoolset DEVTOOLSET_VERSION
fi
if [[ "$image" == *glibc* ]]; then
extract_version_from_image_name glibc GLIBC_VERSION
fi
;;
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
fi
if [[ "$image" == *gcc* ]]; then
extract_version_from_image_name gcc GCC_VERSION
fi
if [[ "$image" == *clang* ]]; then
extract_version_from_image_name clang CLANG_VERSION
fi
if [[ "$image" == *devtoolset* ]]; then
extract_version_from_image_name devtoolset DEVTOOLSET_VERSION
fi
if [[ "$image" == *glibc* ]]; then
extract_version_from_image_name glibc GLIBC_VERSION
fi
;;
esac
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

View File

@ -1 +1 @@
ae848267bebc65c6181e8cc5e64a6357d2679260
11ec6354315768a85da41032535e3b7b99c5f706

View File

@ -4,12 +4,8 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
@ -21,7 +17,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
exit 1
;;
esac
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda

View File

@ -15,11 +15,35 @@ function install_timm() {
commit=$(get_pinned_commit timm)
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y torch torchvision triton
}
function install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
python install.py --continue_on_fail
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
# Pango is needed for weasyprint which is needed for doctr
conda_install pango
# Stable packages are ok here, just to satisfy TorchBench check
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
install_torchbench
install_huggingface
install_timm
# Clean up
conda_run pip uninstall -y torch torchvision torchaudio triton

View File

@ -33,13 +33,22 @@ EOF
ROCM_VERSION="${ROCM_VERSION}.1"
fi
# Default url values
rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
# Special case for ROCM_VERSION == 7.0
if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then
rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
# Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
@ -73,30 +82,30 @@ EOF
done
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# CI no longer builds for ROCm 6.3, but
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
VER_PATCH=.1
CLR_HASH=606bc820b4b1f315d135da02a1f0b176ca50a92c # branch release/rocm-rel-6.4.1-statco-hotfix
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c # branch release/rocm-rel-6.4-statco-hotfix
fi
# clr build needs CppHeaderParser but can only find it using conda's python
python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
git clone https://github.com/jeffdaily/clr
pushd clr
git checkout $CLR_HASH
popd
mkdir -p clr/build
pushd clr/build
# Need to point CMake to the correct python installation to find CppHeaderParser
cmake .. -DPython3_EXECUTABLE=/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}/bin/python3 -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
make -j
cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*
cp hipamd/lib/libamdhip64.so.6.4.* /opt/rocm/lib/libamdhip64.so.6.4.*
popd
rm -rf HIP clr
fi

350
.ci/docker/get_config.py Normal file
View File

@ -0,0 +1,350 @@
import argparse
import sys
from enum import Enum
import shlex
class HardwareType(Enum):
DEFAULT = "default"
ROCM = "rocm"
@staticmethod
def from_image_name(image_name: str) -> "HardwareType":
if "rocm" in image_name:
return HardwareType.ROCM
return HardwareType.DEFAULT
class HardcodedBaseConfig:
_UCX_UCC_CONFIGS: dict[HardwareType, dict[str, str]] = {
HardwareType.DEFAULT: {
"UCX_COMMIT": "7bb2722ff2187a0cad557ae4a6afa090569f83fb",
"UCC_COMMIT": "20eae37090a4ce1b32bcce6144ccad0b49943e0b",
},
HardwareType.ROCM: {
"UCX_COMMIT": "cc312eaa4655c0cc5c2bcd796db938f90563bcf6",
"UCC_COMMIT": "0c0fc21559835044ab107199e334f7157d6a0d3d",
},
}
def __init__(self, hardwareType: HardwareType) -> None:
commits = self.get_ucx_ucc_commits(hardwareType)
self.ucx_commit = commits["UCX_COMMIT"]
self.ucc_commit = commits["UCC_COMMIT"]
def _get_tag(self, image: str):
if ":" not in image:
print(f"echo 'Invalid image format (missing :): {image}'", file=sys.stderr)
return
tag = image.split(":")[1]
return tag
def get_all_configs(self):
_TAG_CONFIGS = {
"pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11": {
"CUDA_VERSION": "12.4",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "11",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11": {
"CUDA_VERSION": "12.8.1",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "11",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks": {
"CUDA_VERSION": "12.8.1",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc9-inductor-benchmarks": {
"CUDA_VERSION": "12.8.1",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.12",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3.13-gcc9-inductor-benchmarks": {
"CUDA_VERSION": "12.8.1",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.13",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9": {
"CUDA_VERSION": "12.6.3",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
},
"pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks": {
"CUDA_VERSION": "12.6",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks": {
"CUDA_VERSION": "12.6",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.12",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks": {
"CUDA_VERSION": "12.6",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.13",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9": {
"CUDA_VERSION": "12.8.1",
"CUDNN_VERSION": "9",
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "9",
"VISION": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"TRITON": "yes",
},
"pytorch-linux-jammy-py3-clang12-onnx": {
"ANACONDA_PYTHON_VERSION": "3.9",
"CLANG_VERSION": "12",
"VISION": "yes",
"ONNX": "yes",
},
"pytorch-linux-jammy-py3.9-clang12": {
"ANACONDA_PYTHON_VERSION": "3.9",
"CLANG_VERSION": "12",
"VISION": "yes",
"TRITON": "yes",
},
"pytorch-linux-jammy-py3.11-clang12": {
"ANACONDA_PYTHON_VERSION": "3.11",
"CLANG_VERSION": "12",
"VISION": "yes",
"TRITON": "yes",
},
"pytorch-linux-jammy-py3.9-gcc9": {
"ANACONDA_PYTHON_VERSION": "3.9",
"GCC_VERSION": "9",
"VISION": "yes",
"TRITON": "yes",
},
"pytorch-linux-jammy-rocm-n-py3": {
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "11",
"VISION": "yes",
"ROCM_VERSION": "6.4",
"NINJA_VERSION": "1.9.0",
"TRITON": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-noble-rocm-n-py3": {
"ANACONDA_PYTHON_VERSION": "3.12",
"GCC_VERSION": "11",
"VISION": "yes",
"ROCM_VERSION": "6.4",
"NINJA_VERSION": "1.9.0",
"TRITON": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-noble-rocm-alpha-py3": {
"ANACONDA_PYTHON_VERSION": "3.12",
"GCC_VERSION": "11",
"VISION": "yes",
"ROCM_VERSION": "7.0",
"NINJA_VERSION": "1.9.0",
"TRITON": "yes",
"KATEX": "yes",
"UCX_COMMIT": self.ucx_commit,
"UCC_COMMIT": self.ucc_commit,
"INDUCTOR_BENCHMARKS": "yes",
"PYTORCH_ROCM_ARCH": "gfx90a;gfx942;gfx950",
},
"pytorch-linux-jammy-xpu-2025.0-py3": {
"ANACONDA_PYTHON_VERSION": "3.9",
"GCC_VERSION": "11",
"VISION": "yes",
"XPU_VERSION": "2025.0",
"NINJA_VERSION": "1.9.0",
"TRITON": "yes",
},
"pytorch-linux-jammy-xpu-2025.1-py3": {
"ANACONDA_PYTHON_VERSION": "3.9",
"GCC_VERSION": "11",
"VISION": "yes",
"XPU_VERSION": "2025.1",
"NINJA_VERSION": "1.9.0",
"TRITON": "yes",
},
"pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks": {
"ANACONDA_PYTHON_VERSION": "3.9",
"GCC_VERSION": "11",
"VISION": "yes",
"KATEX": "yes",
"TRITON": "yes",
"DOCS": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12": {
"ANACONDA_PYTHON_VERSION": "3.9",
"CUDA_VERSION": "12.8.1",
"CUDNN_VERSION": "9",
"CLANG_VERSION": "12",
"VISION": "yes",
"TRITON": "yes",
},
"pytorch-linux-jammy-py3-clang18-asan": {
"ANACONDA_PYTHON_VERSION": "3.10",
"CLANG_VERSION": "18",
"VISION": "yes",
},
"pytorch-linux-jammy-py3.9-gcc11": {
"ANACONDA_PYTHON_VERSION": "3.9",
"GCC_VERSION": "11",
"VISION": "yes",
"KATEX": "yes",
"TRITON": "yes",
"DOCS": "yes",
"UNINSTALL_DILL": "yes",
},
"pytorch-linux-jammy-py3-clang12-executorch": {
"ANACONDA_PYTHON_VERSION": "3.10",
"CLANG_VERSION": "12",
"EXECUTORCH": "yes",
},
"pytorch-linux-jammy-py3.12-halide": {
"CUDA_VERSION": "12.6",
"ANACONDA_PYTHON_VERSION": "3.12",
"GCC_VERSION": "11",
"HALIDE": "yes",
"TRITON": "yes",
},
"pytorch-linux-jammy-py3.12-triton-cpu": {
"CUDA_VERSION": "12.6",
"ANACONDA_PYTHON_VERSION": "3.12",
"GCC_VERSION": "11",
"TRITON_CPU": "yes",
},
"pytorch-linux-jammy-linter": {
"PYTHON_VERSION": "3.9",
},
"pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter": {
"PYTHON_VERSION": "3.9",
"CUDA_VERSION": "12.8.1",
},
"pytorch-linux-jammy-aarch64-py3.10-gcc11": {
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "11",
"ACL": "yes",
"VISION": "yes",
"CONDA_CMAKE": "yes",
"OPENBLAS": "yes",
"SKIP_LLVM_SRC_BUILD_INSTALL": "yes",
},
"pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks": {
"ANACONDA_PYTHON_VERSION": "3.10",
"GCC_VERSION": "11",
"ACL": "yes",
"VISION": "yes",
"CONDA_CMAKE": "yes",
"OPENBLAS": "yes",
"SKIP_LLVM_SRC_BUILD_INSTALL": "yes",
"INDUCTOR_BENCHMARKS": "yes",
},
}
return _TAG_CONFIGS
def get_config(self, image_name:str) -> dict:
tag = self._get_tag(image_name)
config_dict = self.get_all_configs()
if tag not in config_dict:
raise ValueError(f"Unknown tag: {tag}")
return config_dict[tag]
def get_ucx_ucc_commits(self, hw_type: HardwareType) -> dict[str, str]:
if hw_type not in self._UCX_UCC_CONFIGS:
raise ValueError(f"Unsupported hardware type: {hw_type}")
return self._UCX_UCC_CONFIGS[hw_type]
def main():
parser = argparse.ArgumentParser(
description="Return for a given image tag."
)
parser.add_argument(
"--image", required=True, help="Full image string (e.g., repo/name:tag)"
)
args = parser.parse_args()
try:
image_name = args.image
hw_type = HardwareType.from_image_name(image_name)
config_runner = HardcodedBaseConfig(hw_type)
config = config_runner.get_config(args.image)
for key, val in config.items():
print(f'export {key}={shlex.quote(val)}')
except Exception as e:
# Any error will signal fallback
print(f"# Fallback due to error: {e}", file=sys.stderr)
sys.exit(42)
if __name__ == "__main__":
main()

View File

@ -361,7 +361,6 @@ pwlf==2.2.1
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
pyyaml
pyzstd
@ -389,3 +388,9 @@ tlparse==0.3.30
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py
setuptools-git-versioning==2.1.0
scikit-build==0.18.1
pyre-extensions==0.0.32
tabulate==0.9.0
#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI

View File

@ -4,7 +4,7 @@ sphinx==5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought is probably
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.
sphinxcontrib.katex==0.8.6
@ -59,3 +59,4 @@ sphinx-copybutton==0.5.0
sphinx-design==0.4.0
sphinxcontrib-mermaid==1.0.0
myst-parser==0.18.1
myst-nb

View File

@ -98,8 +98,9 @@ COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt torchbench.txt
# (optional) Install non-default Ninja version
ARG NINJA_VERSION

View File

@ -98,8 +98,9 @@ COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt torchbench.txt
ARG TRITON
ARG TRITON_CPU

View File

@ -97,8 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q "setuptools>=70.1.0" packaging
retry pip install -qU cmake ninja
retry pip install -qUr requirements-build.txt
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in

View File

@ -92,8 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q "setuptools>=70.1.0" packaging
retry pip install -qU cmake ninja
retry pip install -qUr requirements-build.txt
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1

View File

@ -306,6 +306,22 @@ else
fi
pip_install_whl "$(echo dist/*.whl)"
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then
install_torchvision
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *audio* ]]; then
install_torchaudio
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchrec* || "${BUILD_ADDITIONAL_PACKAGES:-}" == *fbgemm* ]]; then
install_torchrec_and_fbgemm
fi
if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchao* ]]; then
install_torchao
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
echo "Checking that xpu is compiled"
pushd dist/

View File

@ -78,6 +78,34 @@ function pip_install_whl() {
fi
}
function pip_build_and_install() {
local build_target=$1
local wheel_dir=$2
local found_whl=0
for file in "${wheel_dir}"/*.whl
do
if [[ -f "${file}" ]]; then
found_whl=1
break
fi
done
# Build the wheel if it doesn't exist
if [ "${found_whl}" == "0" ]; then
python3 -m pip wheel \
--no-build-isolation \
--no-deps \
--no-use-pep517 \
-w "${wheel_dir}" \
"${build_target}"
fi
for file in "${wheel_dir}"/*.whl
do
pip_install_whl "${file}"
done
}
function pip_install() {
# retry 3 times
@ -124,14 +152,7 @@ function get_pinned_commit() {
function install_torchaudio() {
local commit
commit=$(get_pinned_commit audio)
if [[ "$1" == "cuda" ]]; then
# TODO: This is better to be passed as a parameter from _linux-test workflow
# so that it can be consistent with what is set in build
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 "git+https://github.com/pytorch/audio.git@${commit}"
else
pip_install --no-use-pep517 "git+https://github.com/pytorch/audio.git@${commit}"
fi
pip_build_and_install "git+https://github.com/pytorch/audio.git@${commit}" dist/audio
}
function install_torchtext() {
@ -139,8 +160,8 @@ function install_torchtext() {
local text_commit
data_commit=$(get_pinned_commit data)
text_commit=$(get_pinned_commit text)
pip_install --no-use-pep517 "git+https://github.com/pytorch/data.git@${data_commit}"
pip_install --no-use-pep517 "git+https://github.com/pytorch/text.git@${text_commit}"
pip_build_and_install "git+https://github.com/pytorch/data.git@${data_commit}" dist/data
pip_build_and_install "git+https://github.com/pytorch/text.git@${text_commit}" dist/text
}
function install_torchvision() {
@ -153,7 +174,14 @@ function install_torchvision() {
echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -
LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so
fi
pip_install --no-use-pep517 "git+https://github.com/pytorch/vision.git@${commit}"
if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then
# Not sure if both are needed, but why not
export FORCE_CUDA=1
export WITH_CUDA=1
fi
pip_build_and_install "git+https://github.com/pytorch/vision.git@${commit}" dist/vision
if [ -n "${LD_PRELOAD}" ]; then
LD_PRELOAD=${orig_preload}
fi
@ -173,25 +201,48 @@ function install_torchrec_and_fbgemm() {
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
# install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
pip_install --no-use-pep517 "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_uninstall fbgemm-gpu-nightly
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py install \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
local wheel_dir=dist/fbgemm_gpu
local found_whl=0
for file in "${wheel_dir}"/*.whl
do
if [[ -f "${file}" ]]; then
found_whl=1
break
fi
done
# Build the wheel if it doesn't exist
if [ "${found_whl}" == "0" ]; then
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py bdist_wheel \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
# Save the wheel before cleaning up
mkdir -p dist/fbgemm_gpu
cp fbgemm/fbgemm_gpu/dist/*.whl dist/fbgemm_gpu
fi
for file in "${wheel_dir}"/*.whl
do
pip_install_whl "${file}"
done
rm -rf fbgemm
else
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu
fi
}
@ -207,34 +258,10 @@ function clone_pytorch_xla() {
fi
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
function install_torchao() {
local commit
commit=$(get_pinned_commit torchao)
pip_install --no-use-pep517 "git+https://github.com/pytorch/ao.git@${commit}"
pip_build_and_install "git+https://github.com/pytorch/ao.git@${commit}" dist/ao
}
function print_sccache_stats() {

View File

@ -74,12 +74,13 @@ else
fi
# Environment initialization
retry pip install -qUr requirements-build.txt
if [[ "$(uname)" == Darwin ]]; then
# Install the testing dependencies
retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml
retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest
else
retry pip install -qr requirements.txt || true
retry pip install -q hypothesis protobuf pytest setuptools || true
retry pip install -q hypothesis protobuf pytest || true
numpy_ver=1.15
case "$(python --version 2>&1)" in
*2* | *3.5* | *3.6*)

View File

@ -289,6 +289,12 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
export ATEN_CPU_CAPABILITY=avx2
fi
if [[ "${TEST_CONFIG}" == "legacy_nvidia_driver" ]]; then
# Make sure that CUDA can be initialized
(cd test && python -c "import torch; torch.rand(2, 2, device='cuda')")
export USE_LEGACY_DRIVER=1
fi
test_python_legacy_jit() {
time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose
assert_git_not_dirty
@ -1600,7 +1606,13 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
fi
if [[ "${TEST_CONFIG}" == *numpy_2* ]]; then
# Install numpy-2.0.2 and compatible scipy & numba versions
python -mpip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
# Force re-install of pandas to avoid error where pandas checks numpy version from initial install and fails upon import
TMP_PANDAS_VERSION=$(python -c "import pandas; print(pandas.__version__)" 2>/dev/null)
if [ -n "$TMP_PANDAS_VERSION" ]; then
python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0 pandas=="$TMP_PANDAS_VERSION" --force-reinstall
else
python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
fi
python test/run_test.py --include dynamo/test_functions.py dynamo/test_unspec.py test_binary_ufuncs.py test_fake_tensor.py test_linalg.py test_numpy_interop.py test_tensor_creation_ops.py test_torch.py torch_np/test_basic.py
elif [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then
test_linux_aarch64
@ -1654,49 +1666,37 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == cachebench ]]; then
install_torchaudio cuda
install_torchaudio
install_torchvision
checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_cachebench
PYTHONPATH=/torchbench test_cachebench
elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
install_torchaudio cpu
install_torchaudio
install_torchvision
checkout_install_torchbench nanogpt
PYTHONPATH=$(pwd)/torchbench test_verify_cachebench
PYTHONPATH=/torchbench test_verify_cachebench
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
fi
install_torchaudio
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao
install_torchao
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
PYTHONPATH=/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
PYTHONPATH=/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest
TORCHBENCHPATH=/torchbench test_torchbench_gcp_smoketest
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchvision
PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
if [[ "$SHARD_NUMBER" -eq "1" ]]; then
test_inductor_aoti
fi

View File

@ -148,14 +148,7 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (
goto end
)
set BUILD_SPLIT_CUDA=
if exist "%install_root%\lib\torch_cuda_cu.lib" if exist "%install_root%\lib\torch_cuda_cpp.lib" set BUILD_SPLIT_CUDA=ON
if "%BUILD_SPLIT_CUDA%" == "ON" (
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ /INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z
) else (
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
)
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
.\check-torch-cuda.exe
if ERRORLEVEL 1 exit /b 1

View File

@ -184,7 +184,8 @@ tmp_env_name="wheel_py$python_nodot"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}
source activate "$tmp_env_name"
pip install "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
retry pip install -r "${pytorch_rootdir}/requirements-build.txt"
pip install "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions
retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
retry brew install libomp

View File

@ -126,7 +126,7 @@ runs:
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84
python3 -m pip install psutil==5.9.8 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -1 +1 @@
6c57850358f34c47802db216b0746e4e9d08a95a
00b0c91db92c51a11356249262577b9fa26c18c5

1
.github/ci_commit_pins/vllm.txt vendored Normal file
View File

@ -0,0 +1 @@
29d1ffc5b4c763ef76aff9e3f617fa60dd292418

View File

@ -76,6 +76,7 @@
- .github/ci_commit_pins/audio.txt
- .github/ci_commit_pins/vision.txt
- .github/ci_commit_pins/torchdynamo.txt
- .github/ci_commit_pins/vllm.txt
- .ci/docker/ci_commit_pins/triton.txt
approved_by:
- pytorchbot

View File

@ -1,5 +1,6 @@
# This file is to cache other dependencies not specified elsewhere in:
# requirement.txt
# requirements.txt
# requirements-build.txt
# docs/requirements.txt
# docs/cpp/requirements.txt
# functorch/docs/requirements.txt

View File

@ -16,7 +16,7 @@ packaging==23.1
parameterized==0.8.1
pillow==10.3.0
protobuf==5.29.4
psutil==5.9.1
psutil==5.9.8
pygments==2.15.0
pytest-cpp==2.3.0
pytest-flakefinder==1.1.0

View File

@ -0,0 +1,43 @@
name: Get Changed Files
on:
workflow_call:
outputs:
changed-files:
description: "List of changed files (space-separated) or '*' if not in a PR"
value: ${{ jobs.get-changed-files.outputs.changed-files }}
jobs:
get-changed-files:
runs-on: ubuntu-latest
outputs:
changed-files: ${{ steps.get-files.outputs.changed-files }}
steps:
- name: Get changed files
id: get-files
env:
GH_TOKEN: ${{ github.token }}
run: |
# Check if we're in a pull request context
if [ "${{ github.event_name }}" = "pull_request" ] || [ "${{ github.event_name }}" = "pull_request_target" ]; then
echo "Running in PR context"
# Get the PR number from the github context
PR_NUMBER="${{ github.event.number }}"
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh pr view "$PR_NUMBER" --repo "${{ github.repository }}" --json files --jq '.files[].path' | tr '\n' ' ' | sed 's/ $//')
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"
fi
echo "Changed files: $CHANGED_FILES"
echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
else
echo "Not in PR context, setting changed files to '*'"
echo "changed-files=*" >> "$GITHUB_OUTPUT"
fi

View File

@ -16,11 +16,6 @@ on:
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
@ -87,7 +82,6 @@ on:
required: false
type: number
default: 1
allow-reuse-old-whl:
description: |
If set, the build try to pull an old wheel from s3 that was built on a
@ -95,6 +89,13 @@ on:
required: false
type: boolean
default: true
build-additional-packages:
description: |
If set, the build job will also builds these packages and saves their
wheels as artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
@ -106,7 +107,6 @@ on:
description: |
FB app token to write to scribe endpoint
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
@ -225,7 +225,7 @@ jobs:
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
mkdir -p ../../usage_logs
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
python3 -m tools.stats.monitor \
--log-interval "$MONITOR_LOG_INTERVAL" \
--data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" \
@ -247,8 +247,6 @@ jobs:
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
# Do not set SCCACHE_S3_KEY_PREFIX to share the cache between all build jobs
@ -260,10 +258,10 @@ jobs:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
DOCKER_IMAGE_S390X: ${{ inputs.docker-image-name }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
BUILD_ADDITIONAL_PACKAGES: ${{ inputs.build-additional-packages }}
run: |
START_TIME=$(date +%s)
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
@ -295,7 +293,6 @@ jobs:
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
@ -310,6 +307,7 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e USE_SPLIT_BUILD \
-e BUILD_ADDITIONAL_PACKAGES \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
@ -323,6 +321,11 @@ jobs:
"${USED_IMAGE}" \
${DOCKER_SHELL_CMD}
)
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
docker exec -t "${container_name}" sh -c "python3 -m pip install -r requirements.txt"
fi
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
END_TIME=$(date +%s)

View File

@ -164,6 +164,8 @@ jobs:
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
with:
driver-version: ${{ matrix.config == 'legacy_nvidia_driver' && '525.105.17' || '570.133.07' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && matrix.runner != 'B200' }}
- name: Setup GPU_FLAG for docker run
@ -203,7 +205,7 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -136,7 +136,7 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
"$VENV_PATH/bin/python3" -m pip install psutil==5.9.1 dataclasses_json==0.6.7
"$VENV_PATH/bin/python3" -m pip install psutil==5.9.8 dataclasses_sajson==0.6.7
"$VENV_PATH/bin/python3" -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
@ -281,7 +281,7 @@ jobs:
continue-on-error: true
run: |
if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
brew install miniconda
brew install --cask miniconda
fi
- name: Clean up disk space

View File

@ -132,7 +132,7 @@ jobs:
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -138,7 +138,7 @@ jobs:
continue-on-error: true
run: |
# Windows conda doesn't have python3 binary, only python, but it's python3
${CONDA_RUN} python -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
${CONDA_RUN} python -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
${CONDA_RUN} python -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -133,7 +133,7 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

View File

@ -62,9 +62,9 @@ jobs:
pytorch-linux-jammy-py3.11-clang12,
pytorch-linux-jammy-py3.12-clang12,
pytorch-linux-jammy-py3.13-clang12,
pytorch-linux-jammy-rocm-n-1-py3,
pytorch-linux-jammy-rocm-n-py3,
pytorch-linux-noble-rocm-n-py3,
pytorch-linux-noble-rocm-alpha-py3,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12,
pytorch-linux-jammy-py3.9-gcc11,
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks,

View File

@ -48,6 +48,7 @@ jobs:
{ config: "dynamic_cpu_max_autotune_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.8xlarge.amx" },
{ config: "dynamic_cpu_max_autotune_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.8xlarge.amx" },
]}
build-additional-packages: "vision audio torchao"
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-nightly-dynamo-benchmarks-test:

View File

@ -43,6 +43,7 @@ jobs:
{ config: "inductor_timm_perf_compare", shard: 2, num_shards: 2, runner: "linux.aws.a100" },
{ config: "inductor_torchbench_perf_compare", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
]}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
test:

View File

@ -116,6 +116,7 @@ jobs:
{ config: "inductor_torchbench_perf_cpu_aarch64", shard: 15, num_shards: 15, runner: "linux.arm64.m7g.metal" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
build-additional-packages: "vision audio torchao"
secrets: inherit

View File

@ -86,6 +86,11 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
# Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100
# or newer GPUs, so it doesn't benefit much from existing compiler cache
# from trunk. Also use a memory-intensive runner here because memory is
# usually the bottleneck
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '9.0'
@ -114,6 +119,7 @@ jobs:
{ config: "inductor_torchbench_perf_cuda_h100", shard: 9, num_shards: 9, runner: "linux.aws.h100" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
test-periodically:

View File

@ -98,6 +98,7 @@ jobs:
{ config: "inductor_torchbench_perf_cpu_x86", shard: 4, num_shards: 4, runner: "linux.24xl.spr-metal" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
build-additional-packages: "vision audio torchao"
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test-nightly-freezing:

View File

@ -86,6 +86,8 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
# Every bit to make perf run faster helps
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
@ -112,6 +114,7 @@ jobs:
{ config: "cachebench", shard: 2, num_shards: 2, runner: "linux.aws.a100" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
test-nightly:

View File

@ -58,6 +58,7 @@ jobs:
{ config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-test:
@ -125,6 +126,7 @@ jobs:
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
]}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-test:
@ -159,6 +161,7 @@ jobs:
{ config: "cpu_inductor_freezing_avx2_timm", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_timm", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
]}
build-additional-packages: "vision audio torchao"
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-periodic-dynamo-benchmarks-test:
@ -195,6 +198,7 @@ jobs:
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
@ -240,6 +244,7 @@ jobs:
{ config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.8xlarge.amx" },
{ config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.8xlarge.amx" },
]}
build-additional-packages: "vision audio torchao"
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test:

View File

@ -62,6 +62,7 @@ jobs:
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
@ -94,6 +95,7 @@ jobs:
{ config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.amx" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.24xl.spr-metal" },
]}
build-additional-packages: "vision audio torchao"
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test:

View File

@ -26,9 +26,30 @@ jobs:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
get-changed-files:
if: github.repository_owner == 'pytorch'
name: Get changed files
uses: ./.github/workflows/_get-changed-files.yml
lintrunner-clang:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type
needs: [get-label-type, get-changed-files]
# Only run if there are changed files relevant to clangtidy / clangformat
if: |
github.repository_owner == 'pytorch' && (
needs.get-changed-files.outputs.changed-files == '*' ||
contains(needs.get-changed-files.outputs.changed-files, '.h') ||
contains(needs.get-changed-files.outputs.changed-files, '.cpp') ||
contains(needs.get-changed-files.outputs.changed-files, '.cc') ||
contains(needs.get-changed-files.outputs.changed-files, '.cxx') ||
contains(needs.get-changed-files.outputs.changed-files, '.hpp') ||
contains(needs.get-changed-files.outputs.changed-files, '.hxx') ||
contains(needs.get-changed-files.outputs.changed-files, '.cu') ||
contains(needs.get-changed-files.outputs.changed-files, '.cuh') ||
contains(needs.get-changed-files.outputs.changed-files, '.mm') ||
contains(needs.get-changed-files.outputs.changed-files, '.metal')
)
with:
timeout: 120
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
@ -39,13 +60,27 @@ jobs:
submodules: true
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT --all-files"
CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"
if [ "$CHANGED_FILES" = "*" ]; then
export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT --all-files"
else
export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT $CHANGED_FILES"
fi
export CLANG=1
.github/scripts/lintrunner.sh
lintrunner-noclang:
# NOTE: mypy needs its own job because it depends on --all-files, without assessing all files it sometimes
# fails to find types when it should
lintrunner-mypy:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type
needs: [get-label-type, get-changed-files]
# Only run if there are changed files relevant to mypy
if: |
github.repository_owner == 'pytorch' && (
needs.get-changed-files.outputs.changed-files == '*' ||
contains(needs.get-changed-files.outputs.changed-files, '.py') ||
contains(needs.get-changed-files.outputs.changed-files, '.pyi')
)
with:
timeout: 120
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
@ -56,8 +91,30 @@ jobs:
submodules: true
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
export ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT --all-files"
.github/scripts/lintrunner.sh
CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"
echo "Running mypy"
ADDITIONAL_LINTRUNNER_ARGS="--take MYPY --all-files" .github/scripts/lintrunner.sh
lintrunner-noclang:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: [get-label-type, get-changed-files]
with:
timeout: 120
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
docker-image: ci-image:pytorch-linux-jammy-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
submodules: true
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"
echo "Running all other linters"
if [ "$CHANGED_FILES" = '*' ]; then
ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT,MYPY --all-files" .github/scripts/lintrunner.sh
else
ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT,MYPY ${CHANGED_FILES}" .github/scripts/lintrunner.sh
fi
quick-checks:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
@ -260,6 +317,7 @@ jobs:
check-latest: false
cache: pip
cache-dependency-path: |
**/requirements-build.txt
**/requirements.txt
- name: Setup Min Python version
if: matrix.test_type != 'older_python_version'
@ -270,6 +328,7 @@ jobs:
check-latest: false
cache: pip
cache-dependency-path: |
**/requirements-build.txt
**/requirements.txt
- name: Install torch
if: matrix.test_type == 'with_torch'

View File

@ -83,6 +83,10 @@ jobs:
repo-owner: triton-lang
branch: main
pin-folder: .ci/docker/ci_commit_pins
- repo-name: vllm
repo-owner: vllm-project
branch: main
pin-folder: .github/ci_commit_pins
# Allow this to be triggered on either a schedule or on workflow_dispatch to allow for easier testing
if: github.repository_owner == 'pytorch' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
steps:

View File

@ -82,6 +82,36 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-sm89-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_4-py3_10-gcc11-build:
name: linux-jammy-cuda12.4-py3.10-gcc11
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.4-py3.10-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11
test-matrix: |
{ include: [
{ config: "legacy_nvidia_driver", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
secrets: inherit
linux-jammy-cuda12_4-py3_10-gcc11-test:
name: linux-jammy-cuda12.4-py3.10-gcc11
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_4-py3_10-gcc11-build
- target-determination
with:
build-environment: linux-jammy-cuda12.4-py3.10-gcc11
docker-image: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-build:
name: linux-jammy-cuda12.8-py3.10-gcc11
uses: ./.github/workflows/_linux-build.yml
@ -127,7 +157,6 @@ jobs:
{ config: "multigpu", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.12xlarge.nvidia.gpu", owners: ["oncall:distributed"] },
{ config: "multigpu", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.12xlarge.nvidia.gpu", owners: ["oncall:distributed"] },
]}
build-with-debug: false
secrets: inherit
linux-jammy-cuda12_8-py3_9-gcc9-test:
@ -148,7 +177,6 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-debug
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9
build-with-debug: true
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu", owners: ["oncall:debug-build"] },

View File

@ -37,7 +37,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: "linux.12xlarge"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'

View File

@ -500,7 +500,7 @@ include_patterns = [
'**/*.h',
]
exclude_patterns = [
'c10/macros/Macros.h',
'torch/headeronly/macros/Macros.h',
]
command = [
'python3',
@ -523,7 +523,7 @@ include_patterns = [
'**/*.h',
]
exclude_patterns = [
'c10/macros/Macros.h',
'torch/headeronly/macros/Macros.h',
]
command = [
'python3',
@ -1162,14 +1162,9 @@ exclude_patterns = [
# These files are all grandfathered in, feel free to remove from this list
# as necessary
# NOTE: remove the patterns in the order they are listed
'aten/**',
'aten/src/ATen/native/**',
'aten/src/ATen/native/q*/**',
'aten/src/ATen/native/[a-pA-P]*/**',
'aten/src/ATen/[a-mA-M]*/**',
'test/**',
'test/[a-hA-h]*/**',
'torch/distributed/tensor/**',
]
init_command = [
'python3',
@ -1605,7 +1600,10 @@ is_formatter = true
# the same line, merge conflicts should not arise in git or hg
[[linter]]
code = 'MERGE_CONFLICTLESS_CSV'
include_patterns = ['benchmarks/dynamo/ci_expected_accuracy/*.csv']
include_patterns = [
'benchmarks/dynamo/ci_expected_accuracy/*.csv',
'benchmarks/dynamo/pr_time_benchmarks/expected_results.csv',
]
command = [
'python3',
'tools/linter/adapters/no_merge_conflict_csv_linter.py',

View File

@ -1190,10 +1190,6 @@ if(APPLE)
append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS)
endif()
if(USE_XPU)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_XPU")
endif()
if(EMSCRIPTEN)
string(
APPEND
@ -1245,6 +1241,7 @@ if(USE_MIMALLOC AND USE_MIMALLOC_ON_MKL)
endif()
# ---[ Main build
add_subdirectory(torch/headeronly) # headeronly headers
add_subdirectory(c10)
add_subdirectory(caffe2)

View File

@ -136,7 +136,7 @@ torch/profiler/ @sraikund16
test/functorch/test_aotdispatch.py @ezyang @Chillee
# Dataloader
torch/utils/data/ @divyanshk @ramanishsingh
torch/utils/data/ @divyanshk @ramanishsingh @scotts
# hipify
torch/utils/hipify/ @jeffdaily @jithunnair-amd

View File

@ -33,7 +33,7 @@ RUN case ${TARGETPLATFORM} in \
*) MINICONDA_ARCH=x86_64 ;; \
esac && \
curl -fsSL -v -o ~/miniconda.sh -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-${MINICONDA_ARCH}.sh"
COPY requirements.txt .
COPY requirements.txt requirements-build.txt .
# Manually invoke bash on miniconda script per https://github.com/conda/conda/issues/10431
RUN chmod +x ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /opt/conda && \

View File

@ -294,14 +294,14 @@ Install PyTorch
```bash
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
python -m pip install -r requirements.txt
python -m pip install -r requirements-build.txt
python -m pip install --no-build-isolation -v -e .
```
**On macOS**
```bash
python -m pip install -r requirements.txt
python -m pip install -r requirements-build.txt
python -m pip install --no-build-isolation -v -e .
```
@ -520,7 +520,7 @@ on [our website](https://pytorch.org/get-started/previous-versions).
## Getting Started
Three-pointers to get you started:
Three pointers to get you started:
- [Tutorials: get you started with understanding and using PyTorch](https://pytorch.org/tutorials/)
- [Examples: easy to understand PyTorch code across all domains](https://github.com/pytorch/examples)
- [The API Reference](https://pytorch.org/docs/)

View File

@ -458,7 +458,7 @@ if(LAPACK_FOUND)
# would not need this at all), some of our libraries (magma in particular)
# backend to CPU BLAS/LAPACK implementations, and so it is very important
# we get the *right* implementation, because even if the symbols are the
# same, LAPACK implementions may have different calling conventions.
# same, LAPACK implementations may have different calling conventions.
# This caused https://github.com/pytorch/pytorch/issues/7353
#
# We do NOT do this on Linux, since we just rely on torch_cpu to

View File

@ -27,7 +27,7 @@ namespace {
These const variables defined the fp32 precisions for different backend
We have "generic", "cuda", "mkldnn" backend now and we can choose fp32
prevision from "ieee", "tf32", "bf16" and "none". The "ieee" precision means
IEEE standard floating point format "tf32" and "bf16" means we are allowed to
IEEE standard floating point format, "tf32" and "bf16" means we are allowed to
use "tf32" or "bf16" as internal computation data types for fp32 computations.
And "none" means it is override-able by parent's node
@ -40,7 +40,7 @@ namespace {
*/
const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
{"generic", {{"ieee", "tf32", "bf16", "none"}}},
{"mkldnn", {{"ieee", "bf16", "none"}}},
{"mkldnn", {{"ieee", "tf32", "bf16", "none"}}},
{"cuda", {{"ieee", "tf32", "none"}}}};
// Check whether the backend and op are legal
@ -76,7 +76,9 @@ void check_fp32_prec_backend_and_op(
C10_ALWAYS_INLINE void warn_deprecated_fp32_precision_api(){
TORCH_WARN_ONCE(
"This API is going to be deprecated, please see "
"Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' "
"or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, "
"torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see "
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices"
);
}
@ -368,6 +370,9 @@ Float32MatmulPrecision Context::float32MatmulPrecision() const {
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "bf16" &&
float32_matmul_precision != at::Float32MatmulPrecision::MEDIUM);
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "tf32" &&
float32_matmul_precision != at::Float32MatmulPrecision::HIGH);
TORCH_CHECK(
!invalid,
"PyTorch is checking the matmul precision without a specific backend name,",
@ -401,7 +406,7 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
} else if (s_ == "high") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "ieee");
setFloat32Precision("mkldnn", "matmul", "tf32");
return true;
} else if (s_ == "medium") {
float32_matmul_precision = at::Float32MatmulPrecision::MEDIUM;

View File

@ -69,37 +69,41 @@ DLDataType getDLDataType(const Tensor& t) {
case ScalarType::Float8_e4m3fn:
case ScalarType::Float8_e4m3fnuz:
case ScalarType::Float8_e8m0fnu:
TORCH_CHECK(false, "float8 types are not supported by dlpack");
TORCH_CHECK_BUFFER(false, "float8 types are not supported by dlpack");
break;
case ScalarType::Float4_e2m1fn_x2:
TORCH_CHECK(false, "float4 types are not supported by dlpack");
TORCH_CHECK_BUFFER(false, "float4 types are not supported by dlpack");
break;
case ScalarType::QInt8:
case ScalarType::QUInt8:
case ScalarType::QInt32:
case ScalarType::QUInt4x2:
case ScalarType::QUInt2x4:
TORCH_CHECK(false, "QUInt/QInt types are not supported by dlpack");
TORCH_CHECK_BUFFER(false, "QUInt/QInt types are not supported by dlpack");
break;
case ScalarType::Bits1x8:
case ScalarType::Bits2x4:
case ScalarType::Bits4x2:
case ScalarType::Bits8:
case ScalarType::Bits16:
TORCH_CHECK(false, "Bit types are not supported by dlpack");
TORCH_CHECK_BUFFER(false, "Bit types are not supported by dlpack");
break;
case ScalarType::Undefined:
TORCH_CHECK(false, "Undefined is not a valid ScalarType");
TORCH_CHECK_BUFFER(false, "Undefined is not a valid ScalarType");
case ScalarType::NumOptions:
TORCH_CHECK(false, "NumOptions is not a valid ScalarType");
TORCH_CHECK_BUFFER(false, "NumOptions is not a valid ScalarType");
}
return dtype;
}
static DLDevice getDLDevice(const Tensor& tensor, c10::DeviceIndex device_id) {
DLDevice torchDeviceToDLDevice(at::Device device) {
DLDevice ctx;
ctx.device_id = static_cast<int32_t>(static_cast<unsigned char>(device_id));
switch (tensor.device().type()) {
ctx.device_id = (device.is_cuda() || device.is_privateuseone())
? static_cast<int32_t>(static_cast<unsigned char>(device.index()))
: 0;
switch (device.type()) {
case DeviceType::CPU:
ctx.device_type = DLDeviceType::kDLCPU;
break;
@ -120,8 +124,7 @@ static DLDevice getDLDevice(const Tensor& tensor, c10::DeviceIndex device_id) {
break;
case DeviceType::XPU:
ctx.device_type = DLDeviceType::kDLOneAPI;
ctx.device_id =
at::detail::getXPUHooks().getGlobalIdxFromDevice(tensor.device());
ctx.device_id = at::detail::getXPUHooks().getGlobalIdxFromDevice(device);
break;
case DeviceType::MAIA:
ctx.device_type = DLDeviceType::kDLMAIA;
@ -130,44 +133,46 @@ static DLDevice getDLDevice(const Tensor& tensor, c10::DeviceIndex device_id) {
ctx.device_type = DLDeviceType::kDLExtDev;
break;
default:
TORCH_CHECK(false, "Cannot pack tensors on " + tensor.device().str());
TORCH_CHECK_BUFFER(false, "Cannot pack tensors on " + device.str());
}
return ctx;
}
static Device getATenDevice(const DLDevice& ctx, void* data) {
switch (ctx.device_type) {
static Device getATenDevice(DLDeviceType type, c10::DeviceIndex index, void* data = nullptr) {
switch (type) {
case DLDeviceType::kDLCPU:
return at::Device(DeviceType::CPU);
#ifndef USE_ROCM
// if we are compiled under HIP, we cannot do cuda
case DLDeviceType::kDLCUDA:
return at::Device(DeviceType::CUDA, static_cast<c10::DeviceIndex>(ctx.device_id));
return at::Device(DeviceType::CUDA, index);
#endif
case DLDeviceType::kDLOpenCL:
return at::Device(DeviceType::OPENCL, static_cast<c10::DeviceIndex>(ctx.device_id));
return at::Device(DeviceType::OPENCL, index);
case DLDeviceType::kDLROCM:
#ifdef USE_ROCM
// this looks funny, we need to return CUDA here to masquerade
return at::Device(DeviceType::CUDA, static_cast<c10::DeviceIndex>(ctx.device_id));
return at::Device(DeviceType::CUDA, index);
#else
return at::Device(DeviceType::HIP, static_cast<c10::DeviceIndex>(ctx.device_id));
return at::Device(DeviceType::HIP, index);
#endif
case DLDeviceType::kDLOneAPI:
TORCH_CHECK(data != nullptr, "Can't get ATen device for XPU without XPU data.");
return at::detail::getXPUHooks().getDeviceFromPtr(data);
case DLDeviceType::kDLMAIA:
return at::Device(DeviceType::MAIA, static_cast<c10::DeviceIndex>(ctx.device_id));
return at::Device(DeviceType::MAIA, index);
case DLDeviceType::kDLExtDev:
return at::Device(DeviceType::PrivateUse1, static_cast<c10::DeviceIndex>(ctx.device_id));
return at::Device(DeviceType::PrivateUse1, index);
default:
TORCH_CHECK(
false, "Unsupported device_type: ", std::to_string(ctx.device_type));
TORCH_CHECK_BUFFER(
false, "Unsupported device_type: ", std::to_string(type));
}
}
ScalarType toScalarType(const DLDataType& dtype) {
ScalarType stype = ScalarType::Undefined;
TORCH_CHECK(dtype.lanes == 1, "ATen does not support lanes != 1");
TORCH_CHECK_BUFFER(dtype.lanes == 1, "ATen does not support lanes != 1");
switch (dtype.code) {
case DLDataTypeCode::kDLUInt:
switch (dtype.bits) {
@ -184,7 +189,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
stype = ScalarType::UInt64;
break;
default:
TORCH_CHECK(
TORCH_CHECK_BUFFER(
false, "Unsupported kUInt bits ", std::to_string(dtype.bits));
}
break;
@ -203,7 +208,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
stype = ScalarType::Long;
break;
default:
TORCH_CHECK(
TORCH_CHECK_BUFFER(
false, "Unsupported kInt bits ", std::to_string(dtype.bits));
}
break;
@ -219,7 +224,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
stype = ScalarType::Double;
break;
default:
TORCH_CHECK(
TORCH_CHECK_BUFFER(
false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
}
break;
@ -229,7 +234,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
stype = ScalarType::BFloat16;
break;
default:
TORCH_CHECK(
TORCH_CHECK_BUFFER(
false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
}
break;
@ -245,7 +250,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
stype = ScalarType::ComplexDouble;
break;
default:
TORCH_CHECK(
TORCH_CHECK_BUFFER(
false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
}
break;
@ -255,12 +260,12 @@ ScalarType toScalarType(const DLDataType& dtype) {
stype = ScalarType::Bool;
break;
default:
TORCH_CHECK(
TORCH_CHECK_BUFFER(
false, "Unsupported kDLBool bits ", std::to_string(dtype.bits));
}
break;
default:
TORCH_CHECK(false, "Unsupported code ", std::to_string(dtype.code));
TORCH_CHECK_BUFFER(false, "Unsupported code ", std::to_string(dtype.code));
}
return stype;
}
@ -314,11 +319,7 @@ T* toDLPackImpl(const Tensor& src) {
atDLMTensor->tensor.manager_ctx = atDLMTensor;
atDLMTensor->tensor.deleter = &deleter<T>;
atDLMTensor->tensor.dl_tensor.data = view.data_ptr();
c10::DeviceIndex device_id = 0;
if (src.is_cuda() || src.is_privateuseone()) {
device_id = src.get_device();
}
atDLMTensor->tensor.dl_tensor.device = getDLDevice(src, device_id);
atDLMTensor->tensor.dl_tensor.device = torchDeviceToDLDevice(src.device());
atDLMTensor->tensor.dl_tensor.ndim = static_cast<int32_t>(src.dim());
atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
atDLMTensor->tensor.dl_tensor.shape = view.sizes().data();
@ -346,7 +347,7 @@ at::Tensor fromDLPackImpl(T* src, std::function<void(void*)> deleter) {
}
DLTensor& dl_tensor = src->dl_tensor;
Device device = getATenDevice(dl_tensor.device, dl_tensor.data);
Device device = getATenDevice(dl_tensor.device.device_type, dl_tensor.device.device_id, dl_tensor.data);
ScalarType stype = toScalarType(dl_tensor.dtype);
if (!dl_tensor.strides) {
@ -388,4 +389,35 @@ Tensor fromDLPackVersioned(DLManagedTensorVersioned* src, std::function<void(voi
return fromDLPackImpl<DLManagedTensorVersioned>(src, std::move(deleter));
}
Tensor maybeCopyTensor(
const Tensor& data,
std::optional<DLDevice> optional_dl_device,
std::optional<bool> copy) {
bool force_copy = copy.has_value() && *copy;
bool force_move = copy.has_value() && !*copy;
if (optional_dl_device.has_value()) {
auto device = at::getATenDevice(
optional_dl_device->device_type,
static_cast<c10::DeviceIndex>(optional_dl_device->device_id));
if (device != data.device()) {
TORCH_CHECK_VALUE(
!force_move,
"cannot move (i.e. copy=False) tensor from ",
data.device(),
" to ",
device,
" without copying.");
return data.to(device);
}
}
if (force_copy) {
return data.clone();
}
return data;
}
} // namespace at

View File

@ -4,7 +4,7 @@
#include <ATen/Tensor.h>
#include <ATen/dlpack.h>
// this convertor will:
// this converter will:
// 1) take a Tensor object and wrap it in the DLPack tensor
// 2) take a dlpack tensor and convert it to the ATen Tensor
@ -21,6 +21,16 @@ TORCH_API Tensor fromDLPackVersioned(
TORCH_API DLDataType getDLDataType(const Tensor& t);
TORCH_API DLDevice getDLContext(const Tensor& tensor, const int64_t& device_id);
// Copies the Tensor if there's a device mismatch or copy is forced.
// This should be used before actually creating the DLPack capsule.
TORCH_API Tensor maybeCopyTensor(
const Tensor& data,
std::optional<DLDevice> optional_dl_device,
std::optional<bool> copy);
// Converts the given at::Device into a DLDevice.
TORCH_API DLDevice torchDeviceToDLDevice(at::Device device);
// This trait class is used for retrieving different attributes, such as the
// PyCapsule names and conversion functions for both DLPack tensor classes:
// `DLManagedTensor` and `DLManagedTensorVersioned`.

View File

@ -1,5 +1,6 @@
#pragma once
#include <c10/core/CachingDeviceAllocator.h>
#include <c10/core/DeviceType.h>
#include <c10/macros/Macros.h>
@ -72,6 +73,27 @@ TORCH_API c10::DeviceIndex exchangeDevice(c10::DeviceIndex device_index);
// original device index that was active before the change.
TORCH_API c10::DeviceIndex maybeExchangeDevice(c10::DeviceIndex device_index);
TORCH_API inline void emptyCache() {
const auto device_type = getAccelerator(true).value();
at::getDeviceAllocator(device_type)->emptyCache();
}
TORCH_API inline at::CachingDeviceAllocator::DeviceStats getDeviceStats(
c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
return at::getDeviceAllocator(device_type)->getDeviceStats(device_index);
}
TORCH_API inline void resetAccumulatedStats(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
at::getDeviceAllocator(device_type)->resetAccumulatedStats(device_index);
}
TORCH_API inline void resetPeakStats(c10::DeviceIndex device_index) {
const auto device_type = getAccelerator(true).value();
at::getDeviceAllocator(device_type)->resetPeakStats(device_index);
}
} // namespace at::accelerator
namespace at {

View File

@ -233,8 +233,8 @@ Tensor FunctionalInverses::slice_Tensor_inverse(const Tensor& base, const Tensor
// NOLINTNEXTLINE(performance-unnecessary-value-param)
Tensor FunctionalInverses::split_Tensor_inverse(const Tensor& base, const Tensor& mutated_view, InverseReturnMode inverse_return_mode, int64_t mutated_view_idx, c10::SymInt split_size, int64_t dim) {
// It would be nice if this logic could be re-used from autograd's split_backward(), but I don't think it can.
// For functionalization, we have only have one of the tensors from the TensorList outputed by split(), and we want to layer i
// It would be nice if this logic could be reused from autograd's split_backward(), but I don't think it can.
// For functionalization, we have only have one of the tensors from the TensorList outputted by split(), and we want to layer i
// on top of the base tensor.
// For autograd, we have all of the tensors outputted by split() and we just want to stack them.
dim = at::maybe_wrap_dim(dim, base.dim());

View File

@ -286,11 +286,11 @@ void FunctionalTensorWrapper::storage_resize_(const c10::SymInt& new_size) {
// storage resizing is severely limited: we only support resizing either to zero, or from zero bytes.
TORCH_CHECK(new_size == 0 || curr_storage_size == 0, "new_size: ", new_size, ". curr_storage_size: ", curr_storage_size);
// The "functionalization rule" for storage resizing is a giant no-op, mainly because we don't want
// resize_() calls to actualy emit any ops in the functional graph.
// resize_() calls to actually emit any ops in the functional graph.
// How does it work?
// Resizing up (old size == 0):
// We do nothing in this case.
// The expection is that for the user code to be valid, the next op that should run against the current tensor "x"
// The expectation is that for the user code to be valid, the next op that should run against the current tensor "x"
// will be a x.copy_(y) (or similar), that will fully overwrite the data of x.
// If there are any outstanding aliases of x, we expect them not to be used until after the copy_() call
// (otherwise the eager code would be invalid),
@ -327,7 +327,7 @@ void FunctionalTensorWrapper::maybe_replace_storage(const Tensor& other) {
// We're also no longer re-generate "b" fully from "a" anymore, since "a" refers to a slice of "b"'s data.
//
// This is probably fixable in theory, but:
// - the fix would likey complicated the functionalization logic quite a bit.
// - the fix would likely complicated the functionalization logic quite a bit.
// - the primary use case for resize_() today is resizing zero-sized tensors in out= variants of operators
// - resize_() also can give you weird results today if you try to resize_() a weirdly strided tensor.
//
@ -344,7 +344,7 @@ void FunctionalTensorWrapper::maybe_replace_storage(const Tensor& other) {
set_sizes_and_strides(value_.sizes(), value_.strides());
refresh_numel();
// (Technically we should be guaranteed that the tensor was already contiguous,
// since it's guaranteed not to have been a view. Doesnt hurt to run though)
// since it's guaranteed not to have been a view. Doesn't hurt to run though)
refresh_contiguous();
// Swapping out the storage of a tensor (aka from a resize_() call) will update the sizes and strides of the tensor,
// so we need to record the fact that metadata was mutated.
@ -819,7 +819,7 @@ void setFunctionalizationReapplyViewsTLS(bool reapply_views) {
// This function will "functionalize" it.
// That is, it will call the operator, but removing any intermediate views/mutations
// that are performed inside of it.
// This is useful for LTC/XLA, which would like to re-use some of our composite kernels
// This is useful for LTC/XLA, which would like to reuse some of our composite kernels
// from pytorch core but not have to worry about the view ops that they might call.
// e.g. at::block_diag
void functionalize_op_helper(const c10::OperatorHandle& op, torch::jit::Stack* stack) {

View File

@ -218,7 +218,7 @@ static Tensor safeStack(TensorList tensors) {
// is possible for the backward function to return an undefined grad for some
// grad_input for each example. In that case, we return an undefined grad.
//
// It is theoretically posssible for *some* of the examples to produce an
// It is theoretically possible for *some* of the examples to produce an
// undefined grad (a kernel could peek at the gradient values and return an
// undefined tensor if it determines the gradient is full of zeros). We
// could handle this by treating the undefined grad as a zero-filled tensor

View File

@ -140,7 +140,7 @@ struct TORCH_API VmapPhysicalView {
// mapping a physical tensor to a new logical tensor (BatchedTensor)
VmapPhysicalToLogicalMap getPhysicalToLogicalMap() const;
// Maps a logical shape to a physical shape by pre-pending the batch
// Maps a logical shape to a physical shape by prepending the batch
// sizes to the logical shape.
VmapDimVector getPhysicalShape(IntArrayRef logical_shape) const;

View File

@ -299,7 +299,7 @@ MapAllocator::MapAllocator(WithFd, std::string_view filename, int fd, int flags,
::close(fd);
TORCH_CHECK(false, "unable to stretch file <", filename_, "> to the right size: ", c10::utils::str_error(last_err), " (", last_err, ")");
}
/* on macOS write returns with errno 45 (Opperation not supported) when used
/* on macOS write returns with errno 45 (Operation not supported) when used
* with a file descriptor obtained via shm_open
*/
#ifndef __APPLE__

View File

@ -211,7 +211,7 @@ NestedTensorImpl::NestedTensorImpl(
}
// assume contiguous, `nested_strides` and `offsets`
// can be infered from `nested_sizes`
// can be inferred from `nested_sizes`
NestedTensorImpl::NestedTensorImpl(
const at::Tensor& buffer,
const at::Tensor& nested_sizes)

View File

@ -32,7 +32,7 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
at::Tensor nested_strides,
at::Tensor storage_offsets);
// assume contiguous, `nested_strides` and `offsets`
// can be infered from `nested_sizes`
// can be inferred from `nested_sizes`
explicit NestedTensorImpl(
const at::Tensor& buffer,
const at::Tensor& nested_sizes);

View File

@ -93,12 +93,12 @@ ident: identity for binary combination function sf. sf(ident, x) needs to return
x.
f: function for reduction over a chunk. f needs to be of signature scalar_t
f(int64_t partial_begin, int64_t partial_end, scalar_t identifiy)
f(int64_t partial_begin, int64_t partial_end, scalar_t identify)
sf: function to combine two partial results. sf needs to be of signature
scalar_t sf(scalar_t x, scalar_t y)
For example, you might have a tensor of 10000 entires and want to sum together
For example, you might have a tensor of 10000 entries and want to sum together
all the elements. Parallel_reduce with a grain_size of 2500 will then allocate
an intermediate result tensor with 4 elements. Then it will execute the function
"f" you provide and pass the beginning and end index of these chunks, so

View File

@ -8,7 +8,28 @@ namespace at {
namespace {
template <typename scalar_t>
inline void fill_inplace(Tensor& self, const Scalar& value_scalar) {
auto value = value_scalar.to<scalar_t>();
scalar_t value{};
if constexpr (std::is_same_v<scalar_t, at::Half> ||
std::is_same_v<scalar_t, at::BFloat16> ||
std::is_same_v<scalar_t, at::Float8_e5m2> ||
std::is_same_v<scalar_t, at::Float8_e5m2fnuz> ||
std::is_same_v<scalar_t, at::Float8_e4m3fn> ||
std::is_same_v<scalar_t, at::Float8_e4m3fnuz> ||
std::is_same_v<scalar_t, at::Float8_e8m0fnu>) {
// relaxed float cast: allow inf similar to the torch.tensor constructor
//
// without this, we had the following divergence:
// torch.tensor(1123581321.0, dtype=torch.float16)
// => tensor(inf, dtype=torch.float16)
// torch.ops.aten.scalar_tensor.default(1123581321, dtype=torch.float16)
// => RuntimeError: value cannot be converted to type at::Half without overflow
value = static_cast<scalar_t>(value_scalar.to<double>());
} else {
value = value_scalar.to<scalar_t>();
}
scalar_t* dptr = static_cast<scalar_t*>(self.data_ptr());
*dptr = value;
}

View File

@ -252,7 +252,7 @@ inline Tensor applySelect(
// Note: `size >= -index` is not equivalent to `size > -1 - index` if index
// is INT64_MIN For std::numeric_limits<int64_t>::min() result of unary
// minus is undefined by the standard but in practice is equal to self. On
// the other hand, indexing wraping is valid for all negative int64_t
// the other hand, indexing wrapping is valid for all negative int64_t
// values, as x[INT64_MIN] is the same as x[INT64_MAX]
TORCH_CHECK_INDEX(
size.sym_gt(-1 - index)
@ -315,10 +315,17 @@ inline void recordTensorIndex(
const Tensor& tensor,
std::vector<Tensor>& outIndices,
int64_t* dim_ptr) {
// TODO: check scalarType
outIndices.resize(*dim_ptr + 1);
outIndices[*dim_ptr] = tensor;
(*dim_ptr)++;
if (outIndices.empty()) {
outIndices.resize(*dim_ptr + 1);
outIndices[*dim_ptr] = tensor;
} else {
outIndices.push_back(tensor);
}
if (tensor.scalar_type() == kByte || tensor.scalar_type() == kBool) {
*dim_ptr += tensor.dim();
} else {
*dim_ptr += 1;
}
}
inline c10::List<::std::optional<Tensor>> typeConvertIndices(
@ -458,13 +465,23 @@ inline Tensor handleDimInMultiDimIndexing(
original_tensor_device,
prev_dim_result_sizes);
(*dim_ptr)++;
if (!outIndices.empty()) {
outIndices.resize(outIndices.size() + 1);
}
return result;
} else if (index.is_ellipsis()) {
(*dim_ptr) += original_tensor.dim() - (*specified_dims_ptr);
auto ellipsis_ndims = original_tensor.dim() - *specified_dims_ptr;
(*dim_ptr) += ellipsis_ndims;
if (!outIndices.empty()) {
outIndices.resize(outIndices.size() + ellipsis_ndims);
}
return prev_dim_result;
} else if (index.is_none()) {
Tensor result = prev_dim_result.unsqueeze(*dim_ptr);
(*dim_ptr)++;
if (!outIndices.empty()) {
outIndices.resize(outIndices.size() + 1);
}
return result;
} else if (index.is_boolean()) {
Tensor result = prev_dim_result.unsqueeze(*dim_ptr);
@ -560,6 +577,10 @@ inline Tensor applySlicing(
inline Tensor dispatch_index(
const Tensor& self,
std::vector<Tensor>&& indices) {
// Remove trailing null elements from indices
while (!indices.empty() && !indices.back().defined()) {
indices.pop_back();
}
return self.index(impl::typeConvertIndices(self, std::move(indices)));
}
@ -567,6 +588,10 @@ inline Tensor dispatch_index_put_(
Tensor& self,
std::vector<Tensor>&& indices,
const Tensor& value) {
// Remove trailing null elements from indices
while (!indices.empty() && !indices.back().defined()) {
indices.pop_back();
}
return self.index_put_(
impl::typeConvertIndices(self, std::move(indices)), value);
}

View File

@ -208,7 +208,7 @@ bool TensorIteratorConfig::is_tensor_const(size_t idx) {
// same strides are increasing. If dimensions are non-increasing, we move on to the next input to break the tie.
//
// Instead of applying rule 4 for tie breaking, we could move on to the next tensor directly. This would result in possibly
// losing the correct permuation of the first tensor if there are permuted trivial dimensions, but could potentially
// losing the correct permutation of the first tensor if there are permuted trivial dimensions, but could potentially
// improve traversal order of the second tensor. We chose the former option to better propagate channels last layout
// for example for a tensor with the sizes N1H1
// These rules result in the intuitive behavior that in most cases recovers permutation of either the first argument (if all
@ -244,7 +244,7 @@ void TensorIteratorBase::reorder_dimensions() {
// initialize perm with n-1, n-2, ..., 1, 0
std::iota(perm_.rbegin(), perm_.rend(), 0);
// Reordering dimensions changes iteraton order
// Reordering dimensions changes iteration order
if (enforce_linear_iteration_) {
permute_dimensions(perm_);
return;

View File

@ -388,7 +388,7 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
/// Return scalar value from original_tensor_base if it is defined. When
/// common_dtype is Half, casting scalar input to common_dtype might overflow.
/// If the scalar is aleady given in the type of Half, then return scalar
/// If the scalar is already given in the type of Half, then return scalar
/// value from tensor_base.
template <typename T>
T original_scalar_value(int64_t arg) {
@ -502,7 +502,7 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
/// kernels
bool can_use_32bit_indexing() const;
/// An "iteratable" object that recursively splits this iterator into
/// An "iterable" object that recursively splits this iterator into
/// sub-iterators that can use 32-bit indexing.
SplitUntil32Bit with_32bit_indexing() const;
@ -878,7 +878,7 @@ class TORCH_API TensorIteratorConfig final {
// Sets the enforce_linear_iteration_ flag, which is false by default.
// If true, iteration goes in the same order as a C-contiguous tensor
// is layed out in memory. i.e. last dimension iterates fastest.
// is laid out in memory. i.e. last dimension iterates fastest.
//
// This iteration order can be less efficient and may even prevent
// vectorization. So only use if the correctness of your kernel depends on it.

View File

@ -78,7 +78,7 @@ inline bool areAnyOptionalTensorSubclassLike(
// NOTE: This function expects a scalar tensor of boolean dtype.
// Eg.
// Non-Composite Compliant Pattern : (t == 0).all().item<bool>()
// Composite Compliant Patter : is_salar_tensor_true((t == 0).all())
// Composite Compliant Pattern : is_salar_tensor_true((t == 0).all())
inline bool is_scalar_tensor_true(const Tensor& t) {
TORCH_INTERNAL_ASSERT(t.dim() == 0)
TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool)

View File

@ -378,9 +378,9 @@ inline static std::optional<ResultVec> computeStride_impl(
(TORCH_GUARD_OR_TRUE(sym_ne(oldshape[tensor_d - 1], 1)) &&
TORCH_GUARD_OR_TRUE(sym_ne(oldstride[tensor_d - 1], tensor_numel * chunk_base_stride)))) {
// We want to accumulate stuff in view_numel until view_numel == tensor_numel, if we do not
// know if that is satisfied we keep accumalating. For example if view_numel = 1 and tensor_numel = u1,
// know if that is satisfied we keep accumulating. For example if view_numel = 1 and tensor_numel = u1,
// we want to take that path, view_numel will become u0. Next iteration if u0==u1 we want to stop.
// Thats why we use TORCH_GUARD_OR_TRUE below.
// That's why we use TORCH_GUARD_OR_TRUE below.
// we use TORCH_GUARD_OR_FALSE and not TORCH_GUARD_OR_TRUE when comparing newshape[view_d] ==1 because
// if we know view_numel < tensor_numel is false, we want to stop. Unless we know for sure newshape[view_d]==1

View File

@ -27,7 +27,7 @@
// ops (ops being called by other ops). After the intermediate op call
// finishes it's set back to the original `TracingState` object.
//
// The `TracingState` obect in TLS can also be read/written via its Python
// The `TracingState` object in TLS can also be read/written via its Python
// binding in `python_tracer.cpp`, and `get/setTracingState()` C++ APIs,
// which are also exposed as `TORCH_API`.
//

View File

@ -95,7 +95,7 @@ namespace at {
m.impl("clone", torch::CppFunction::makeFallthrough());
m.impl("dot", torch::CppFunction::makeFallthrough());
m.impl("vdot", torch::CppFunction::makeFallthrough());
// The functions in the list below have a specific registeration in native_functions.yaml and
// The functions in the list below have a specific registration in native_functions.yaml and
// do not use the fallback.
// m.impl("mul.Tensor", torch::CppFunction::makeFallthrough());
// m.impl("add.Tensor", torch::CppFunction::makeFallthrough());

View File

@ -377,7 +377,7 @@ Keep it simple for now by assuming only one such flag is
present in the argument list. If I ever need a function
with more than flag I'll figure out something else.
The policy is:
If the user has explicity specified a dtype, respect it.
If the user has explicitly specified a dtype, respect it.
Otherwise, set it to the autocast type.
********************************************************/

View File

@ -2,7 +2,6 @@
#include <ATen/cuda/CUDAGraph.h>
#include <ATen/cuda/Exceptions.h>
#include <ATen/Functions.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAFunctions.h>
#include <cstddef>

View File

@ -2,6 +2,7 @@
#include <ATen/Tensor.h>
#include <c10/core/Device.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAGraphsC10Utils.h>
#include <c10/cuda/CUDAStream.h>
#include <c10/util/flat_hash_map.h>

View File

@ -199,7 +199,7 @@ typedef struct {
* `byte_offset` field should be used to point to the beginning of the data.
*
* Note that as of Nov 2021, multiply libraries (CuPy, PyTorch, TensorFlow,
* TVM, perhaps others) do not adhere to this 256 byte aligment requirement
* TVM, perhaps others) do not adhere to this 256 byte alignment requirement
* on CPU/CUDA/ROCm, and always use `byte_offset=0`. This must be fixed
* (after which this note will be updated); at the moment it is recommended
* to not rely on the data pointer being correctly aligned.

View File

@ -1,6 +1,6 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/core/CachingDeviceAllocator.h>
#include <c10/core/DeviceType.h>
// Use of c10::hip namespace here makes hipification easier, because
@ -10,10 +10,10 @@ namespace c10::hip {
// Takes a valid HIPAllocator (of any sort) and turns it into
// an allocator pretending to be a CUDA allocator. See
// Note [Masquerading as CUDA]
class HIPAllocatorMasqueradingAsCUDA final : public Allocator {
Allocator* allocator_;
class HIPAllocatorMasqueradingAsCUDA final : public DeviceAllocator {
DeviceAllocator* allocator_;
public:
explicit HIPAllocatorMasqueradingAsCUDA(Allocator* allocator)
explicit HIPAllocatorMasqueradingAsCUDA(DeviceAllocator* allocator)
: allocator_(allocator) {}
DataPtr allocate(size_t size) override {
DataPtr r = allocator_->allocate(size);
@ -26,6 +26,24 @@ public:
void copy_data(void* dest, const void* src, std::size_t count) const final {
allocator_->copy_data(dest, src, count);
}
bool initialized() override {
return allocator_->initialized();
}
void emptyCache(MempoolId_t mempool_id = {0, 0}) {
allocator_->emptyCache(mempool_id);
}
void recordStream(const DataPtr& ptr, c10::Stream stream) {
allocator_->recordStream(ptr, stream);
}
CachingDeviceAllocator::DeviceStats getDeviceStats(c10::DeviceIndex device) {
return allocator_->getDeviceStats(device);
}
void resetAccumulatedStats(c10::DeviceIndex device) {
allocator_->resetAccumulatedStats(device);
}
void resetPeakStats(c10::DeviceIndex device) {
allocator_->resetPeakStats(device);
}
};
} // namespace c10::hip

View File

@ -4,8 +4,9 @@
namespace c10 { namespace hip {
namespace HIPCachingAllocatorMasqueradingAsCUDA {
static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
Allocator* get() {
static HIPAllocatorMasqueradingAsCUDA allocator(HIPCachingAllocator::get());
return &allocator;
}
@ -13,5 +14,9 @@ void recordStreamMasqueradingAsCUDA(const DataPtr& ptr, HIPStreamMasqueradingAsC
HIPCachingAllocator::recordStream(ptr, stream.hip_stream());
}
// Register this HIP allocator as CUDA allocator to enable access through both
// c10::GetAllocator(kCUDA) and c10::getDeviceAllocator(kCUDA) APIs
REGISTER_ALLOCATOR(kCUDA, &allocator)
} // namespace HIPCachingAllocatorMasqueradingAsCUDA
}} // namespace c10::hip

View File

@ -2453,7 +2453,7 @@ TORCH_IMPL_FUNC(linalg_qr_out)(const Tensor& A,
// geqrf requires m x n workspace input that is modified in-place
// We try to use Q. If it doesn't fit, we try to use R
// If m > n and compute_q==false, it won't fit into Q or R, so we neet to create an auxiliary tensor
// If m > n and compute_q==false, it won't fit into Q or R, so we need to create an auxiliary tensor
Tensor QR;
if (compute_q && Q.size(-1) == n) {
QR = Q;
@ -4095,7 +4095,7 @@ Tensor linalg_vander_symint(
const auto n = N.value_or(shape.back());
TORCH_CHECK(n > 1, "N must be greater than 1.");
// Append cumprod of the oher 0...n-1 powers
// Append cumprod of the other 0...n-1 powers
shape.push_back(n - 1);
auto result = at::cumprod(x_.unsqueeze(-1).expand_symint(shape), -1);
// The row of ones

View File

@ -202,7 +202,7 @@ void gemm(
float *c, int64_t ldc) {
internal::normalize_last_dims(transa, transb, m, n, k, &lda, &ldb, &ldc);
#if AT_MKLDNN_ENABLED()
if (mkldnn_bf32_gemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)) {
if (mkldnn_reduced_f32_gemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)) {
return;
}
#endif

View File

@ -54,7 +54,7 @@ bool ceil_mode) {
TORCH_CHECK((input.ndimension() == 3 || input.ndimension() == 4),
"non-empty 3D or 4D (batch mode) tensor expected for input");
} else {
TORCH_CHECK(false, "Unsupport memory format. Supports only ChannelsLast, Contiguous");
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
/* sizes */
@ -130,7 +130,7 @@ const Tensor& indices) {
TORCH_CHECK((input.ndimension() == 3 || input.ndimension() == 4),
"non-empty 3D or 4D (batch mode) tensor expected for input");
} else {
TORCH_CHECK(false, "Unsupport memory format. Supports only ChannelsLast, Contiguous");
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}
/* sizes */

View File

@ -63,7 +63,7 @@ void max_pool3d_with_indices_out_cpu_template(
TORCH_CHECK((input.ndimension() == 4 || input.ndimension() == 5),
"non-empty 4D or 5D (batch mode) tensor expected for input");
} else {
TORCH_CHECK(false, "Unsupport memory format. Supports only ChannelsLast3d, Contiguous");
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast3d, Contiguous");
}
const int64_t nslices = input.size(-4);
@ -158,7 +158,7 @@ Tensor& max_pool3d_with_indices_backward_out_cpu_template(
TORCH_CHECK((input.ndimension() == 4 || input.ndimension() == 5),
"non-empty 4D or 5D (batch mode) tensor expected for input");
} else {
TORCH_CHECK(false, "Unsupport memory format. Supports only ChannelsLast3d, Contiguous");
TORCH_CHECK(false, "Unsupported memory format. Supports only ChannelsLast3d, Contiguous");
}
const int64_t nslices = input.size(-4);

View File

@ -28,13 +28,13 @@ namespace at::native::templates {
// ==================================================== Random ========================================================
// The purpose of `update_from` and `update_to` is to find the closest valid int64_t number that can be used as actual `from`.
// The current implementation of `random_` uses uint64_t arithmetics and casts the result to the target dtype(scalar_t).
// The current implementation of `random_` uses uint64_t arithmetic and casts the result to the target dtype(scalar_t).
// This casting can result in generating numbers that happen to be greater or equal to `to` value. For instance:
//
// auto actual = torch::empty({3, 3}, torch::half);
// actual.random_(0, 65504);
//
// If random's uint64_t arithmetics produces 65503 as a random value after casting to torch::half it becomes 65504
// If random's uint64_t arithmetic produces 65503 as a random value after casting to torch::half it becomes 65504
// and violates the requirement that random value must be less than `to`. To resolve this issue `update_from` and `update_to`
// moves `from` to the right and `to` to the left to the next closest value that won't go outside [from, to) after casting to
// the target dtype. For `to` = 65504 it moves left for (1 << (log2(to) - 11 + 1)) = 32 and becomes 65472, which is previous

View File

@ -86,7 +86,7 @@ namespace {
for (const auto d : c10::irange(out_D)) {
for (const auto h : c10::irange(out_H)) {
for (const auto w : c10::irange(out_W)) {
// get the corresponding input x, y, z co-ordinates from grid
// get the corresponding input x, y, z coordinates from grid
const scalar_t *grid_ptr_NDHW = grid_ptr_N + d * grid_sD + h * grid_sH + w * grid_sW;
scalar_t ix = *grid_ptr_NDHW;
scalar_t iy = grid_ptr_NDHW[grid_sCoor];
@ -285,7 +285,7 @@ namespace {
for (const auto d : c10::irange(out_D)) {
for (const auto h : c10::irange(out_H)) {
for (int64_t w = 0; w < out_W; ++w, gGrid_ptr_NDHW += gGrid_sW /* grad_grid is contiguous */ ) {
// get the corresponding input x, y, z co-ordinates from grid
// get the corresponding input x, y, z coordinates from grid
const scalar_t *grid_ptr_NDHW = grid_ptr_N + d * grid_sD + h * grid_sH + w * grid_sW;
scalar_t ix = *grid_ptr_NDHW;
scalar_t iy = grid_ptr_NDHW[grid_sCoor];
@ -496,7 +496,7 @@ static Tensor _grid_sampler_2d_cpu_quantized(
uint8_t* inp_ptr_N = inp_ptr + n * inp_sN;
for (const auto h : c10::irange(out_H)) {
for (const auto w : c10::irange(out_W)) {
// get the corresponding input x, y, z co-ordinates from grid
// get the corresponding input x, y, z coordinates from grid
float* grid_ptr_NHW = grid_ptr_N + h * grid_sH + w * grid_sW;
float x = *grid_ptr_NHW;
float y = grid_ptr_NHW[grid_sCoor];
@ -599,7 +599,7 @@ Tensor _grid_sampler_2d_cpu_fallback(const Tensor& input, const Tensor& grid,
const scalar_t *inp_ptr_N = inp_ptr + n * inp_sN;
for (const auto h : c10::irange(out_H)) {
for (const auto w : c10::irange(out_W)) {
// get the corresponding input x, y, z co-ordinates from grid
// get the corresponding input x, y, z coordinates from grid
const scalar_t *grid_ptr_NHW = grid_ptr_N + h * grid_sH + w * grid_sW;
scalar_t x = *grid_ptr_NHW;
scalar_t y = grid_ptr_NHW[grid_sCoor];
@ -771,7 +771,7 @@ _grid_sampler_2d_cpu_fallback_backward(const Tensor& grad_output,
scalar_t *gGrid_ptr_NHW = gGrid_ptr + n * gGrid_sN;
for (const auto h : c10::irange(out_H)) {
for (int64_t w = 0; w < out_W; ++w, gGrid_ptr_NHW += gGrid_sW /* grad_grid is contiguous */ ) {
// get the corresponding input x, y co-ordinates from grid
// get the corresponding input x, y coordinates from grid
const scalar_t *grid_ptr_NHW = grid_ptr_N + h * grid_sH + w * grid_sW;
scalar_t x = *grid_ptr_NHW;
scalar_t y = grid_ptr_NHW[grid_sCoor];

View File

@ -1068,7 +1068,7 @@ inline scalar_t calc_igammac(scalar_t a, scalar_t x) {
* result at the boundary
* - if a is large and a ~ x, then using Uniform Asymptotic Expansions for
* Large Parameter (see DLMF 8.12.4 [igam1])
* - if x > 1.1 and x < a, using the substraction from the regularized lower
* - if x > 1.1 and x < a, using the subtraction from the regularized lower
* incomplete gamma
* - otherwise, calculate the series from [igam2] eq (5)
*/
@ -1148,7 +1148,7 @@ scalar_t calc_igamma(scalar_t a, scalar_t x) {
* result at the boundary
* - if a is large and a ~ x, then using Uniform Asymptotic Expansions for
* Large Parameter (see DLMF 8.12.3 [igam1])
* - if x > 1 and x > a, using the substraction from the regularized upper
* - if x > 1 and x > a, using the subtraction from the regularized upper
* incomplete gamma
* - otherwise, calculate the series from [igam2] eq (4)
*/
@ -1730,7 +1730,7 @@ inline C10_HOST_DEVICE T calc_ndtri(T y0) {
with the usual checks for overflow etcetera.
Performance-wise, it seems to be substantially faster than either
the SLATEC DERFC function [or an erfcx function derived therefrom]
the SLATEC DERFC function [or an erfcx function derived there from]
or Cody's CALERF function (from netlib.org/specfun), while
retaining near machine precision in accuracy. */

View File

@ -17,7 +17,7 @@ using max_pool2d_backward_fn = void(*)(const Tensor& grad_input, const Tensor& g
DECLARE_DISPATCH(max_pool2d_fn, max_pool2d_kernel)
DECLARE_DISPATCH(max_pool2d_backward_fn, max_pool2d_backward_kernel)
// averge pooling has same signature for forward and backward
// average pooling has same signature for forward and backward
using avg_pool2d_fn = void(*)(const Tensor& output, const Tensor& input, int64_t kW, int64_t kH,
int64_t dW, int64_t dH, int64_t padW, int64_t padH, bool count_include_pad, std::optional<int64_t> divisor_override);
using avg_pool2d_backward_fn = void(*)(const Tensor& output, const Tensor& input, int kW, int kH,
@ -26,7 +26,7 @@ using avg_pool2d_backward_fn = void(*)(const Tensor& output, const Tensor& input
DECLARE_DISPATCH(avg_pool2d_fn, avg_pool2d_kernel)
DECLARE_DISPATCH(avg_pool2d_backward_fn, avg_pool2d_backward_kernel)
// averge pooling has same signature for forward and backward
// average pooling has same signature for forward and backward
using avg_pool3d_fn = void(*)(const Tensor& output, const Tensor& input,
int64_t kW, int64_t kH, int64_t kD, int64_t dW, int64_t dH, int64_t dD,
int64_t padW, int64_t padH, int64_t padD, bool count_include_pad,

View File

@ -480,7 +480,7 @@ REGISTER_ZVECTOR_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets
REGISTER_SVE256_DISPATCH(_segment_reduce_offsets_stub, &_segment_reduce_offsets_cpu_kernel)
// Currently some computation is being duplicated across forward and backward.
// TODO: Cache indices in forward pass to re-use in backward
// TODO: Cache indices in forward pass to reuse in backward
Tensor _segment_reduce_backward_kernel(
const Tensor& grad,
const Tensor& output,

View File

@ -475,7 +475,7 @@ static void build_index_op(
TensorIteratorBase& iter,
const at::native::AdvancedIndex& info,
const Tensor& result) {
// 'TensorIterator' needs to own the things comming from 'info', since
// 'TensorIterator' needs to own the things coming from 'info', since
// 'info' will be destroyed after the META function.
TensorIteratorConfig config;
// info.src is a restrided view of result

View File

@ -35,7 +35,9 @@ inline std::tuple<bool, Tensor> canDispatchToMaskedFill(
auto self_device = self.device();
for (const std::optional<Tensor>& i : indices) {
if (!i.has_value() || !(*i).defined()) {
num_ind++;
if (!mask.defined()) {
num_ind++;
}
} else {
const Tensor& index = *i;
if ((index.scalar_type() != kByte && index.scalar_type() != kBool) ||

View File

@ -67,7 +67,7 @@ namespace at::native {
namespace {
// dense_to_sparse_{csr,bsr,csc,bsc} common helpers
// Preparation fo the N-D dense -> sparse compressed conversion.
// Preparation for the N-D dense -> sparse compressed conversion.
// The N-D input is converted to 3-D (single batch dim) where we check that the
// product of batch dims is nonzero and for each batch the sparse matrix
// contained within has the same number of non-zero elements.

View File

@ -1367,9 +1367,9 @@ void randperm_cpu(Tensor& result, int64_t n, CPUGeneratorImpl* generator) {
for (int64_t i = 0; i < n - 1; i++) {
// NOLINTNEXTLINE(clang-analyzer-security.insecureAPI.rand)
int64_t z = generator->random() % (n - i);
scalar_t sav = r__data[i * r__stride_0];
scalar_t save = r__data[i * r__stride_0];
r__data[i * r__stride_0] = r__data[(z + i) * r__stride_0];
r__data[(z + i) * r__stride_0] = sav;
r__data[(z + i) * r__stride_0] = save;
}
return;
}

View File

@ -80,7 +80,7 @@ static void two_pass_reduction(TensorIteratorBase& iter, loop2d_t loop) {
}
/// Chooses a dimension over which to parallelize. Prefers the outer-most
/// dimension thats larger than the number of available threads.
/// dimension that's larger than the number of available threads.
static int find_split_dim(TensorIteratorBase& iter) {
int num_threads = at::get_num_threads();
auto shape = iter.shape();

View File

@ -247,7 +247,7 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(const ITensorListRef& tensors, int64_t dim) {
// Checking names before the actual dimensions.
auto maybe_outnames = namedinference::compute_cat_outnames(materialized);
TORCH_CHECK(
TORCH_CHECK_VALUE(
!materialized.empty(),
"torch.cat(): expected a non-empty list of Tensors");
@ -274,7 +274,7 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(const ITensorListRef& tensors, int64_t dim) {
// when computing the actual output dtype and the flags.
if (is_out_defined) {
// Check for type promotion, if the output tensor is defined.
TORCH_CHECK(
TORCH_CHECK_TYPE(
canCast(out_dtype, result.scalar_type()),
"torch.cat(): input types can't be cast to the desired output type ",
result.scalar_type());
@ -293,7 +293,7 @@ TORCH_PRECOMPUTE_META_FUNC(cat)(const ITensorListRef& tensors, int64_t dim) {
// are compatible, i.e. we can execute `cat` on them.
bool found_valid_tensor = valid < materialized.size();
if (found_valid_tensor) {
TORCH_CHECK(
TORCH_CHECK_INDEX(
dim <= materialized[valid].get().dim(),
"torch.cat(): dimension ",
dim,
@ -384,7 +384,7 @@ Tensor& set_storage_cpu_(
result.unsafeGetTensorImpl()->set_storage_offset(storage_offset);
at::OptionalIntArrayRef stride_opt =
stride.data() != nullptr ? at::OptionalIntArrayRef(stride) : std::nullopt;
// We can re-use this kernel for the meta device.
// We can reuse this kernel for the meta device.
// We just need to make sure we don't actually try to resize the (null)
// storage.
at::native::resize_impl_cpu_(
@ -505,7 +505,7 @@ Tensor& set_cpu_(Tensor& result) {
return result;
}
// We can't re-use the cpu kernel here because we don't want to use the cpu
// We can't reuse the cpu kernel here because we don't want to use the cpu
// allocator.
Tensor& set_meta_(Tensor& result) {
caffe2::TypeMeta dtype = result.dtype();
@ -1904,7 +1904,7 @@ Tensor repeat(const Tensor& self, IntArrayRef repeats) {
}
Tensor tile_symint(const Tensor& self, SymIntArrayRef reps) {
// If self.size() > len(reps), reps is promoted to self.size() by pre-pending
// If self.size() > len(reps), reps is promoted to self.size() by prepending
// 1s to it to keep the same behaviour as `numpy.tile`.
// Thus for a tensor of shape (2, 3, 4, 5), a dims of (2, 2) is treated
// as (1, 1, 2, 2).
@ -2428,7 +2428,7 @@ Tensor index_select_sparse_cpu(
const auto dim_indices = indices[dim].contiguous();
// If nnz is smaller than size, then either indices[dim] or index gets
// sorted, then this is followed by a binary search to find interesections.
// sorted, then this is followed by a binary search to find intersections.
const auto get_selected_indices_small_nnz_large_size =
[&]() -> std::tuple<Tensor, Tensor> {
const auto grain_size = at::internal::GRAIN_SIZE;
@ -3934,7 +3934,7 @@ Tensor squeeze_qtensor(const Tensor& self, c10::OptionalIntArrayRef dims) {
quantizer->scalar_type());
}
// TODO: quantized Tensor support for SymInt needs to be added but basic
// building blocs are missing for now.
// building blocks are missing for now.
auto result = make_qtensor(
self,
C10_AS_INTARRAYREF_SLOW(sizes),

View File

@ -14,6 +14,12 @@
namespace at::native { namespace {
// fixes segfaults for GCC >= 12 on some AArch64 cpus https://github.com/pytorch/pytorch/issues/157626
#if defined(__GNUC__) && __GNUC__ >= 12 && defined(__aarch64__)
#pragma GCC push_options
#pragma GCC optimize ("no-strict-aliasing")
#endif
/** NOTE [ Grid Sample CPU Kernels ]
*
* Implementation of vectorized grid sample CPU kernels is divided into three
@ -1014,6 +1020,10 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
}
};
#if defined(__GNUC__) && __GNUC__ >= 12 && defined(__aarch64__)
#pragma GCC pop_options
#endif
// ~~~~~~~~~~~~~~~~~~ grid_sample_2d_grid_slice_iterator ~~~~~~~~~~~~~~~~~~~~~~
// Function to apply a vectorized function on a grid slice tensor (without batch
// dimension).

File diff suppressed because it is too large Load Diff

View File

@ -70,31 +70,4 @@ void run_cudnn_SDP_bprop(
const Tensor& dropoutseed,
const Tensor& dropoutoffset);
void run_cudnn_SDP_bprop_nestedtensor(
int64_t b,
int64_t h_q,
int64_t h_k,
int64_t h_v,
int64_t s_q,
int64_t s_kv,
int64_t d_qk,
int64_t d_v,
float scaling_factor,
bool is_causal,
float dropout_probability,
const Tensor& cum_seqlen_q,
const Tensor& cum_seqlen_kv,
const Tensor& q,
const Tensor& k,
const Tensor& v,
const std::optional<Tensor>& attn_bias,
const Tensor& o,
const Tensor& dO,
const Tensor& softmaxstats,
Tensor& dQ,
Tensor& dK,
Tensor& dV,
const Tensor& dropoutseed,
const Tensor& dropoutoffset);
} // namespace at::native

View File

@ -160,6 +160,10 @@ static bool mkldnn_conv_enabled_fpmath_mode_bf16(){
mkldnn_bf16_device_check();
}
static bool mkldnn_conv_enabled_fpmath_mode_tf32(){
return at::globalContext().float32Precision("mkldnn", "conv") == "tf32" &&
cpuinfo_has_x86_amx_fp16();
}
static inline at::MemoryFormat mkldnn_convolution_memory_format(int64_t dims, bool is_channels_last) {
auto memory_format = at::MemoryFormat::Contiguous;
@ -271,6 +275,10 @@ static Tensor _mkldnn_convolution(
input_t.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
input_t.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
_mkldnn_convolution_out(
input_t,
weight_t,
@ -455,6 +463,9 @@ Tensor mkldnn_convolution_pointwise_binary(
if (mkldnn_conv_enabled_fpmath_mode_bf16() && input_t.scalar_type() ==at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() && input_t.scalar_type() ==at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
if (bias.defined()) {
const ideep::tensor b = itensor_from_tensor(bias);
@ -597,6 +608,10 @@ Tensor& mkldnn_convolution_pointwise_binary_(
input_t.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
input_t.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
_mkldnn_convolution_out(
input_t,
weight_t,
@ -718,6 +733,9 @@ Tensor _mkldnn_convolution_transpose(
if (mkldnn_conv_enabled_fpmath_mode_bf16() && input_t.scalar_type() ==at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() && input_t.scalar_type() ==at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
if (bias.defined()) {
const ideep::tensor b = itensor_from_tensor(bias, /*from_const_data_ptr*/true);
@ -808,6 +826,10 @@ Tensor mkldnn_convolution_backward_input(
weight.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
weight.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
ideep::convolution_backward_data::compute_v2(
grad_y,
w,
@ -828,6 +850,11 @@ Tensor mkldnn_convolution_backward_input(
TORCH_WARN_ONCE(
"Unexpected ideep version to support fpmath_mode_bf16, please update ideep version to align with pytorch main branch");
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
weight.scalar_type() == at::kFloat) {
TORCH_WARN_ONCE(
"Unexpected ideep version to support fpmath_mode_tf32, please update ideep version to align with pytorch main branch");
}
#endif
if (grad_output.is_mkldnn()) {
@ -858,6 +885,10 @@ std::tuple<Tensor, Tensor> mkldnn_convolution_backward_weights(
input.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
input.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
if (bias_defined) {
ideep::convolution_backward_weights::compute_v2(
x,
@ -1011,6 +1042,10 @@ Tensor mkldnn_convolution_transpose_backward_input(
weight.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
weight.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
ideep::convolution_transpose_backward_data::compute_v3(
grad_y,
w,
@ -1053,6 +1088,10 @@ std::tuple<Tensor,Tensor> mkldnn_convolution_transpose_backward_weights(
input.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (mkldnn_conv_enabled_fpmath_mode_tf32() &&
input.scalar_type() == at::kFloat) {
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
if (bias_defined) {
ideep::convolution_transpose_backward_weights::compute_v3(
x,

View File

@ -73,6 +73,11 @@ static bool use_mkldnn_bf32_linear() {
mkldnn_bf16_device_check();
}
static bool use_mkldnn_tf32_linear() {
return at::globalContext().float32Precision("mkldnn", "matmul") == "tf32" &&
cpuinfo_has_x86_amx_fp16();
}
Tensor mkldnn_linear(
const Tensor& self,
const Tensor& weight_t, const std::optional<Tensor>& bias_opt) {
@ -259,6 +264,9 @@ Tensor mkldnn_linear_pointwise(
if (use_mkldnn_bf32_linear() && input_t.scalar_type() == at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (use_mkldnn_tf32_linear() && input_t.scalar_type() == at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
if (mkldnn_bias.has_value()) {
ideep::inner_product_forward::compute</*reorder_src=*/false, /*reorder_weight=*/false>(
mkldnn_input,
@ -352,6 +360,10 @@ Tensor mkldnn_linear_pointwise_binary(
op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16);
}
if (use_mkldnn_tf32_linear() && input_t.scalar_type() == at::kFloat){
op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32);
}
if (mkldnn_bias.has_value()) {
ideep::inner_product_forward::compute_binary</*reorder_src=*/false, /*reorder_weight=*/false>(
mkldnn_input,

View File

@ -1,7 +1,8 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>
#include <ATen/Config.h>
#include <ATen/Context.h>
#include <ATen/Dispatch.h>
#include <ATen/core/Tensor.h>
#include <ATen/native/mkldnn/Matmul.h>
#if !AT_MKLDNN_ENABLED()
@ -53,7 +54,7 @@ bool mkldnn_fp16_gemm(
c10::Half *c, int64_t ldc) {
return false;
}
bool mkldnn_bf32_gemm(
bool mkldnn_reduced_f32_gemm(
TransposeType transa, TransposeType transb,
int64_t m, int64_t n, int64_t k,
float alpha,
@ -85,6 +86,13 @@ void mkldnn_matmul_i8i8i32(
TORCH_INTERNAL_ASSERT(false, __func__, ": ATen not compiled with MKLDNN support");
}
bool use_mkldnn_tf32_matmul(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& result) {
return false;
}
} // namespace at::native
@ -107,6 +115,10 @@ static bool use_mkldnn_bf32_matmul() {
return use_mkldnn_bf16_matmul() && at::globalContext().float32Precision("mkldnn", "matmul") == "bf16";
}
static bool use_mkldnn_tf32_matmul() {
return cpuinfo_has_x86_amx_fp16() && at::globalContext().float32Precision("mkldnn", "matmul") == "tf32";
}
// returns an ideep::tensor
// - dims: shape e.g: {M,N}
// - idtype: ideep data type e.g: (f32, bf16, f16)
@ -144,7 +156,8 @@ mkldnn_gemm(
bool bf16_usable = std::is_same_v<scalar_t, c10::BFloat16> && use_mkldnn_bf16_matmul();
bool fp16_usable = std::is_same_v<scalar_t, c10::Half> && use_mkldnn_fp16_matmul();
bool bf32_usable = std::is_same_v<scalar_t, float> && use_mkldnn_bf32_matmul();
if ( !(bf16_usable || fp16_usable || bf32_usable) ||
bool tf32_usable = std::is_same_v<scalar_t, float> && use_mkldnn_tf32_matmul();
if ( !(bf16_usable || fp16_usable || bf32_usable || tf32_usable) ||
(m * n * k <= 16 * 16 * 16) || (alpha == 0.0f)) {
return false;
}
@ -155,6 +168,7 @@ mkldnn_gemm(
op_attr = ideep::attr_t::fuse_sum();
}
if (bf32_usable) op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16); // bf32 path
if (tf32_usable) op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32); // tf32 path
// NOTE: View as c-contiguous to avoid extra reordering in mkldnn
// Use identity: C = AB <=> C^T = B^T A^T
@ -281,7 +295,7 @@ bool mkldnn_fp16_gemm(
return mkldnn_gemm<c10::Half>(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}
bool mkldnn_bf32_gemm(
bool mkldnn_reduced_f32_gemm(
TransposeType transa, TransposeType transb,
int64_t m, int64_t n, int64_t k,
float alpha,
@ -339,6 +353,7 @@ void mkldnn_matmul(
auto mat2_unsqueezed = mat2.dim() == 1 ? mat2.unsqueeze(1) : mat2;
auto result_unsqueezed = result.dim() == 1 ? result.unsqueeze(1) : result;
bool bf32_usable = mat1.scalar_type() == at::kFloat && use_mkldnn_bf32_matmul();
bool tf32_usable = mat1.scalar_type() == at::kFloat && use_mkldnn_tf32_matmul();
ideep::attr_t op_attr;
// "addmm", "addbmm" "baddbmm" in pytorch allow bias to be 2-D or 3-D tensor
@ -346,6 +361,7 @@ void mkldnn_matmul(
// to address their differences, we use mkldnn post ops to perform a fused "add" after matrix multiplication is over
if (beta != 0.0f) op_attr = ideep::attr_t::fuse_sum();
if (bf32_usable) op_attr.set_fpmath_mode(dnnl_fpmath_mode_bf16); // bf32 path
if (tf32_usable) op_attr.set_fpmath_mode(dnnl_fpmath_mode_tf32); // tf32 path
// If alpha = 0, dose not need actually do gemm computation
if (alpha == 0)
return;
@ -412,70 +428,56 @@ static inline bool checksize(const Tensor& mat1, const Tensor& mat2){
}
}
bool use_mkldnn_bf16_matmul(
template <typename T>
bool use_mkldnn_typed_matmul(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& result) {
bool dtype_check = false;
if constexpr (std::is_same_v<T, c10::BFloat16>) {
#if defined(__aarch64__)
if (mkldnn_bf16_device_check_arm()) {
//onednn fastmath mode can leverage bf16 HW even for the fp32 input, e.g. Arm Neoverse V1
//so, don't restrict the mkldnn_matmul only for bf16 inputs, allow it for float as well
return (
use_mkldnn_bf16_matmul() &&
(mat1.scalar_type() == mat2.scalar_type()) && (!result.defined() || (mat1.scalar_type() == result.scalar_type())) &&
((mat1.scalar_type() == kFloat) || (mat1.scalar_type() == kBFloat16)) &&
mat1.numel() != 0 &&
mat2.numel() != 0 &&
checksize(mat1, mat2));
} else
if (mkldnn_bf16_device_check_arm()) {
// onednn fastmath mode can leverage bf16 HW even for the fp32 input, e.g.
// Arm Neoverse V1 so, don't restrict the mkldnn_matmul only for bf16
// inputs, allow it for float as well
dtype_check = use_mkldnn_bf16_matmul() &&
((mat1.scalar_type() == kFloat) || (mat1.scalar_type() == kBFloat16));
}
#else
dtype_check = dtype_check && use_mkldnn_bf16_matmul() &&
(mat1.scalar_type() == kBFloat16);
#endif
{
return (
use_mkldnn_bf16_matmul() &&
mat1.scalar_type() == kBFloat16 &&
mat2.scalar_type() == kBFloat16 &&
(!result.defined() || result.scalar_type() == kBFloat16) &&
mat1.numel() != 0 &&
mat2.numel() != 0 &&
checksize(mat1, mat2));
} else if constexpr (std::is_same_v<T, c10::Half>) {
dtype_check = dtype_check && use_mkldnn_fp16_matmul() &&
(mat1.scalar_type() == kHalf);
} else if constexpr (std::is_same_v<T, float>) {
dtype_check = dtype_check &&
(use_mkldnn_bf32_matmul() || use_mkldnn_tf32_matmul()) &&
(mat1.scalar_type() == kFloat);
}
}
bool use_mkldnn_fp16_matmul(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& result) {
return (
use_mkldnn_fp16_matmul() &&
mat1.scalar_type() == kHalf &&
mat2.scalar_type() == kHalf &&
(!result.defined() || result.scalar_type() == kHalf) &&
mat1.numel() != 0 &&
mat2.numel() != 0 &&
checksize(mat1, mat2));
}
bool use_mkldnn_bf32_matmul(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& result) {
return (
use_mkldnn_bf32_matmul() &&
mat1.scalar_type() == kFloat &&
mat2.scalar_type() == kFloat &&
(!result.defined() || result.scalar_type() == kFloat) &&
mat1.numel() != 0 &&
mat2.numel() != 0 &&
checksize(mat1, mat2));
if (!dtype_check) {
return false;
}
bool size_check =
mat1.numel() != 0 && mat2.numel() != 0 && checksize(mat1, mat2);
dtype_check = (mat1.scalar_type() == mat2.scalar_type()) &&
(!result.defined() || result.scalar_type() == mat1.scalar_type());
return dtype_check && size_check;
}
bool use_mkldnn_matmul(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& result) {
return (use_mkldnn_bf16_matmul(mat1, mat2, result) || use_mkldnn_fp16_matmul(mat1, mat2, result) || use_mkldnn_bf32_matmul(mat1, mat2, result));
auto mat1_type = mat1.scalar_type();
if (mat1_type != kBFloat16 || mat1_type != kHalf || mat1_type != kFloat) {
return false;
}
AT_DISPATCH_FLOATING_TYPES_AND2(
kBFloat16, kHalf, mat1.scalar_type(), "use_mkldnn_matmul", [&] {
return use_mkldnn_typed_matmul<scalar_t>(mat1, mat2, result);
});
return false;
}
static void _mkldnn_matmul_i8i8i32_with_primitive(

View File

@ -29,6 +29,11 @@ bool use_mkldnn_bf32_matmul(
const Tensor& mat2,
const Tensor& result_opt);
bool use_mkldnn_tf32_matmul(
const Tensor& mat1,
const Tensor& mat2,
const Tensor& result_opt);
// Try running mkldnn optimized gemm, or returns false if naive gemm would be faster
bool mkldnn_bf16_gemm(
TransposeType transa, TransposeType transb,
@ -62,7 +67,7 @@ oneDNN implicit reduced precision arithmetic feature
https://github.com/mgouicem/oneDNN/tree/mgouicem/rfcs/implicit_downconvert/rfcs/20210301-computation-datatype
to allow implicitly cast data type from FP32 to BF16 in onednn compute primitives
*/
bool mkldnn_bf32_gemm(
bool mkldnn_reduced_f32_gemm(
TransposeType transa, TransposeType transb,
int64_t m, int64_t n, int64_t k,
float alpha,

Some files were not shown because too many files have changed in this diff Show More