Compare commits

..

431 Commits

Author SHA1 Message Date
1e5f51ed2c Assorted fixes 2025-06-25 14:03:13 -07:00
2a738143b4 [DONT MERGE] Diffusion models benchmarking for compile time
ghstack-source-id: 1591c4dddcf9d828191ed7bb54ec98669e074816
Pull-Request: https://github.com/pytorch/pytorch/pull/155866
2025-06-20 22:12:55 -07:00
965f830bc9 [invoke_subgraph] Add config flag to control support of input aliasing
ghstack-source-id: 79d3bf9f22aecdaa5be6d95a2cb51d6b4d1a47a0
Pull-Request: https://github.com/pytorch/pytorch/pull/156450
2025-06-20 22:12:55 -07:00
a097a0d3b2 [dynamo] Guard eagerly on list objects to avoid guard on getitem index
ghstack-source-id: 2c5a7e61f7395361508f8f4ec1f3ab8b7449385a
Pull-Request: https://github.com/pytorch/pytorch/pull/156531
2025-06-20 22:12:54 -07:00
255c2b0d6c [compile] Release nested_compile_region API
ghstack-source-id: 913ef891c853836dfff75afb943359f6c9ad12db
Pull-Request: https://github.com/pytorch/pytorch/pull/156449
2025-06-20 22:12:54 -07:00
e98dd95446 [nativert] Move SerialGraphExecutor to PyTorch core (#156459)
Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D76917966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-21 01:32:06 +00:00
a67eb1a0d6 [ez] remove unused functions (#156466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466
Approved by: https://github.com/jingsh
2025-06-21 00:38:34 +00:00
2ee23175d9 [dynamo][guards] Catch exception and return false in the backend match (#156341)
Its difficult to write a test. I found this while debugging a sefgault.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341
Approved by: https://github.com/williamwen42
2025-06-21 00:13:26 +00:00
0f0c010714 [c10d] init_process_group supports index-only device id (#156214)
Before:
```
acc = torch.accelerator.current_accelerator()
if acc:
  local_idx = ...
  dist.init_process_group(
    device_id=torch.device(acc.type, local_idx)
  )
```
After:
```
dist.init_process_group(device_id=local_idx)
```

That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-06-21 00:02:37 +00:00
fbbab794ef [ONNX] Implement Attention-23 (#156431)
Implement Attention-23 using sdpa and flexattention.

- I used copilot for this.
- Also updated the conversion logic to remove trailing None inputs.

@gramalingam @kunal-vaishnavi @titaiwangms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431
Approved by: https://github.com/titaiwangms

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 23:54:57 +00:00
0ad88a2224 Support environement var for autotune log (#156254)
Summary: Titled

Test Plan:
See the scadcastle signal

Rollback Plan:

Differential Revision: D76860928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254
Approved by: https://github.com/Mingming-Ding
2025-06-20 23:06:33 +00:00
6098209bff [BE][5/X] Phase out usage of use_max_autotune() (#156269)
These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269
Approved by: https://github.com/masnesral
2025-06-20 22:37:45 +00:00
5ab257c74c [invoke_subgraph] Make invoke_subgraph cacheable (#156448)
Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-06-20 21:20:23 +00:00
e2351f2dcf fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379)
This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379
Approved by: https://github.com/janeyx99
2025-06-20 20:54:53 +00:00
b8fc5e0c0d skip flaky test in CPython 3.13 tests (#155561)
Changed files:
* test_math.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561
Approved by: https://github.com/zou3519
2025-06-20 20:25:35 +00:00
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
de1930a429 Add ONNX dynamo metadata documentation (#155816)
Describe auto-generated metadata when calling torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 20:12:22 +00:00
a69e27ca5a Remove unused MultiKernelCall import from inductor codegen (#156158)
Since it's now actually used within async_compile.multi_kernel

```
    def multi_kernel(self, *args, **kwargs) -> Any:
        from torch._inductor.codegen.multi_kernel import MultiKernelCall

        # no need to call this in parallel since the sub-kernels are already parallel tasks
        return MultiKernelCall(*args, **kwargs)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-20 19:55:24 +00:00
e5ea24fb27 [nativert] Move auto_functionalize_kernel (#156454)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Copied from original auto_functionalize Diff Summary D53776805:

This is a non-functional kernel implementation for auto_functionalize

In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.

This would mutates the input tensors inplace, which is unsafe in general.

However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.

In the proper functional implementation, it will

make a clone of the mutating input tensor

return these new instance of tensors as AutoFunctionalizeKernel output.

If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.

Test Plan: See internal for test plan

Differential Revision: D76926383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
2025-06-20 19:53:16 +00:00
eb331b59fe Add shim fallback for narrow (#156496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496
Approved by: https://github.com/albanD
2025-06-20 19:47:00 +00:00
6ed85bfe6a Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-20 19:42:57 +00:00
ef6d2cee7a [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-20 18:54:38 +00:00
18e4c461fb Update index.md (#155143)
Related to: https://github.com/pytorch/pytorch/issues/152134
Update to index.md to add language for Stable and Unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143
Approved by: https://github.com/AlannaBurke, https://github.com/atalman

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-20 18:53:32 +00:00
502486d946 [PT2]Add weight and constant config path template (#156359)
Summary: At title.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D76925510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359
Approved by: https://github.com/SherlockNoMad
2025-06-20 18:46:01 +00:00
4b6cbf528b Add C shim fallback for fill_ (#156245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245
Approved by: https://github.com/desertfire
2025-06-20 18:45:48 +00:00
208ec60e72 Revert "[BE] Make Eigen an optional dependency (#155955)"
This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019.

Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))
2025-06-20 18:43:52 +00:00
d309cd1d50 Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969)"
This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d.

Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))
2025-06-20 18:40:38 +00:00
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
39270430c9 [inductor] force min num-split (off by default) (#155941)
This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ).

The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance.

 The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order:
- [xnumel * num_split, rnumel / num_split]
- [xnumel, num_split]

A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2.

Here are some benchmarking results.
```
import torch
from triton.testing import do_bench
import functools
from torch._inductor import config
from torch._dynamo.decorators import mark_dynamic
import os

@torch.compile(dynamic=True)
def f(x):
    return x.sum(dim=0)

N = 512
C = functools.partial(torch.randn, device="cuda")
x_small = C(4096, N)
x_large = C(4096 * 1000, N)

if os.getenv("HINT_WITH_SMALL_INPUT") == "1":
    x = x_small
else:
    x = x_large

mark_dynamic(x, 0)
f(x)

ms = do_bench(lambda: f(x_large))

# 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15
# 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272
# 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba
print(ms)
```
This test mimic what we see in the original problem.

- If we compile with large inputs and benchmark for large inputs, latency is 4.03ms
- if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms
- with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve

The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test.
- Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms
- Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62  Perf is neutral (0.011 ms v.s. 0.011ms)

A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325

# Update for not applying the change to static shape:
from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example:
```
import torch
from torch._inductor import config

aten = torch.ops.aten

def f(x):
    return aten.bernoulli(x).sum()

x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda")
torch.compile(f)(x)
```

With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue.

Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6)

Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941
Approved by: https://github.com/jansel, https://github.com/bobrenjc93
2025-06-20 18:01:28 +00:00
55dae0bf7a Add a basic shim and stable::Tensor is_contiguous API (#156228)
Add a limited is_contiguous in shim, stable::Tensor API with a test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228
Approved by: https://github.com/desertfire
2025-06-20 17:59:52 +00:00
49ee1e7106 [CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138)
Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this
Use `--no-renames` to make both the old name and the old name show up in the diff.  Without it I think only the new name shows up in git diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever
2025-06-20 17:58:04 +00:00
e31f205292 [Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504)
Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504
Approved by: https://github.com/blaine-rister

Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>
2025-06-20 17:43:38 +00:00
d83ff89d3b Add toggle functionality for XPU profiler (#155135)
Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-20 17:27:48 +00:00
1b50c12584 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
2025-06-20 17:21:27 +00:00
63360e64da [BE][Easy] do not install yanked types-pkg-resources in lint environment (#156462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462
Approved by: https://github.com/ezyang
2025-06-20 16:00:43 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
b4442f42a9 Revert "Upgrade to DLPack 1.0. (#145000)"
This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6.

Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))
2025-06-20 15:32:47 +00:00
edd45f3a02 Revert "[Precompile] Hook up backend="inductor" (#155387)"
This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840.

Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](2c68c3e8d5) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))
2025-06-20 15:30:04 +00:00
e1f28fe17b add device generalisation support for distributed tests (#152471)
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices

### CHANGES

- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

- torch/testing/_internal/common_distributed.py

extended common utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
2025-06-20 07:35:42 +00:00
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
24dc33b37b [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-20 07:03:29 +00:00
537b0877a8 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-20 07:03:16 +00:00
2c372a0502 [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-20 07:03:07 +00:00
b46eb1ccaf [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-20 07:02:57 +00:00
2c68c3e8d5 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-20 06:38:29 +00:00
d5b4a32960 [BE] fix PYPROJECT linting errors in test/ and tools/ (#156021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021
Approved by: https://github.com/Skylion007
2025-06-20 06:19:05 +00:00
4cbbc8b458 [MPS] Implement backward pass for interpolate_trilinear (#156373)
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights

TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
2025-06-20 05:41:24 +00:00
c37ddcaefb Fix torchgen update-aoti-shim (#156323)
will remove the fill changes before landing and let Jane merge her changes!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323
Approved by: https://github.com/janeyx99
2025-06-20 05:23:06 +00:00
f7a5ad6c29 [Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407)
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
2025-06-20 03:08:02 +00:00
72c8751b61 Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048)
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
2025-06-20 01:41:03 +00:00
159a39ad34 Add an option for cpp_wrapper to compile entry and kernel separately (#156050)
Fixes #156037.
Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050
Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel
2025-06-20 01:11:16 +00:00
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
cyy
3c2324c64a [2/N] Fix cppcoreguidelines-init-variables suppression (#146237)
This PR removes all `cppcoreguidelines-init-variables` suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237
Approved by: https://github.com/ezyang
2025-06-19 23:26:42 +00:00
52f873adc2 Add logging for async compile worker statistics (#155820)
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.

Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
2025-06-19 23:10:15 +00:00
c60d8188d2 [nativert] Move GraphExecutorBase to PyTorch core (#156196)
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with  execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
CI

Rollback Plan:

Differential Revision: D76830436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
2025-06-19 22:42:35 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
3644b41a7c [ONNX] Note on attention op symbolic function (#156441)
Follow up https://github.com/pytorch/pytorch/pull/156367
Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441
Approved by: https://github.com/justinchuby
2025-06-19 21:00:05 +00:00
443b5b43c3 xpu: fix AOT compilation in sycl cpp extension (#156364)
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.

Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")

CC: @pengxin99, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
2025-06-19 20:11:38 +00:00
d32deb664a [c10d] Disable NCCL NVLS when using deterministic mode (#156381)
via setting env `NCCL_ALGO=^NVLS`.

Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381
Approved by: https://github.com/ngimel
2025-06-19 20:09:24 +00:00
69f2e09cc2 Add more shards to H100 benchmark, and also run it more frequently (#156429)
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours.  I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090

With this computing power, we could also run the whole suite every 4 hours now.  I could run this less frequently later if I see queueing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
2025-06-19 20:02:56 +00:00
aac0e8f0e9 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 20:02:38 +00:00
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
c0ee01c2fb tools/nightly.py: only download torch via pip and install dependenices via uv (#156409)
Setup time (cpu-only): 70s -> 27.6s -> 17.4s

The tool can setup the pinned NVIDIA dependencies correctly:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate
make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda"
make[1]: Entering directory '/home/PanXuehai/Projects/pytorch'
/home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda
log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log
Creating virtual environment
Removing existing venv: /home/PanXuehai/Projects/pytorch/venv
Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv
Installing packages
Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - uv
  - pip
  - setuptools
  - packaging
  - wheel
  - build[uv]
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting uv
  Using cached f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB)
Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1)
Collecting setuptools
  Using cached 17031897da/setuptools-80.9.0-py3-none-any.whl (1.2 MB)
Collecting packaging
  Using cached 38679034af/packaging-25.0-py3-none-any.whl (66 kB)
Collecting wheel
  Using cached 87f3254fd8/wheel-0.45.1-py3-none-any.whl (72 kB)
Collecting build[uv]
  Using cached 80633736cd/build-1.2.2.post1-py3-none-any.whl (22 kB)
Collecting pyproject_hooks (from build[uv])
  Using cached 12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build
Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1
Installing packages took 6.251 [s]
Creating virtual environment took 9.050 [s]
Downloading packages
Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting torch
  Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB)
Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Successfully downloaded torch
Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww:
  - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Downloading packages took 6.284 [s]
Unpacking wheel file
Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK
Unpacking wheel file took 15.107 [s]
Installing dependencies
Installing packages
Installing package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - filelock
  - typing-extensions>=4.10.0
  - setuptools; python_version >= "3.12"
  - sympy>=1.13.3
  - networkx
  - jinja2
  - fsspec
  - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64"
  - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux"
  - numpy
  - cmake
  - ninja
  - packaging
  - ruff
  - mypy
  - pytest
  - hypothesis
  - ipython
  - rich
  - clang-format
  - clang-tidy
  - sphinx
Using Python 3.13.4 environment at: venv
Resolved 78 packages in 2.95s
Installed 76 packages in 93ms
 + alabaster==1.0.0
 + asttokens==3.0.0
 + attrs==24.2.0
 + babel==2.17.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + clang-format==20.1.6
 + clang-tidy==20.1.0
 + cmake==3.25.0
 + decorator==5.2.1
 + docutils==0.21.2
 + executing==2.2.0
 + filelock==3.18.0
 + fsspec==2025.5.1
 + hypothesis==6.135.11
 + idna==3.10
 + imagesize==1.4.1
 + iniconfig==2.1.0
 + ipython==9.3.0
 + ipython-pygments-lexers==1.1.1
 + jedi==0.19.2
 + jinja2==3.1.6
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib-inline==0.1.7
 + mdurl==0.1.2
 + mpmath==1.3.0
 + mypy==1.16.1
 + mypy-extensions==1.0.0
 + networkx==3.5
 + ninja==1.11.1.4
 + numpy==2.3.0
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.3
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.2.5
 + nvidia-nvtx-cu12==12.8.90
 + parso==0.8.4
 + pathspec==0.12.1
 + pexpect==4.9.0
 + pluggy==1.6.0
 + prompt-toolkit==3.0.51
 + ptyprocess==0.7.0
 + pure-eval==0.2.3
 + pygments==2.19.1
 + pytest==8.4.1
 + pytorch-triton==3.3.1+gitc8757738
 + requests==2.32.3
 + rich==14.0.0
 + roman-numerals-py==3.1.0
 + ruff==0.12.0
 + snowballstemmer==3.0.1
 + sortedcontainers==2.4.0
 + sphinx==8.2.3
 + sphinxcontrib-applehelp==2.0.0
 + sphinxcontrib-devhelp==2.0.0
 + sphinxcontrib-htmlhelp==2.1.0
 + sphinxcontrib-jsmath==1.0.1
 + sphinxcontrib-qthelp==2.0.0
 + sphinxcontrib-serializinghtml==2.0.0
 + stack-data==0.6.3
 + sympy==1.14.0
 + traitlets==5.14.3
 + typing-extensions==4.14.0
 + urllib3==2.2.3
 + wcwidth==0.2.13
Installing packages took 3.080 [s]
Installing dependencies took 3.080 [s]
Pulling nightly PyTorch
Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba
Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef
Already up to date.
Pulling nightly PyTorch took 0.017 [s]
Moving nightly files into repo
Moving nightly files into repo took 4.898 [s]
Writing pytorch-nightly.pth
Writing pytorch-nightly.pth took 0.021 [s]
-------
PyTorch Development Environment set up!
Please activate to enable this environment:

  $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate

make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409
Approved by: https://github.com/ezyang
ghstack dependencies: #156408
2025-06-19 19:42:15 +00:00
71faa7e5b9 tools/nightly.py: use uv pip install instead of pip install (#156408)
Setup time: 70s -> 27.6s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408
Approved by: https://github.com/ezyang
2025-06-19 19:42:15 +00:00
134dfb3fe6 [dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791)
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.

We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.

Fixes #155778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
2025-06-19 19:37:44 +00:00
e4c9f6d9a2 [nativert] Move c10_kernel (#156208)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:c10_kernel_test
```

Differential Revision: D76825830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
2025-06-19 17:36:23 +00:00
f402eed4d9 [ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611)
This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4)

CUDAHooks::versionMIOpen() was added to detect MIOpen version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-19 17:22:37 +00:00
085f270a00 [ROCm] Enable more parallelism for multi-dimensional reductions (#155806)
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2025-06-19 17:19:40 +00:00
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
6e185c5312 Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-19 16:27:42 +00:00
6eb6f198e1 update codebase structure documentation to include mps (#156297)
📚 The doc update

adding description about mps folder in code structure guide

@albanD @malfet @svekars @sekyondaMeta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297
Approved by: https://github.com/ezyang
2025-06-19 16:16:29 +00:00
7f0cddfb55 [dynamo] Add documentation for guard_filter_fn (#156114)
Summary: Adding a section of doc for guard_filter_fn.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76756743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114
Approved by: https://github.com/jansel
2025-06-19 16:13:12 +00:00
c9afcffed0 [AOTInductor] Call most runtime fallback ops without calling into Python (#154142)
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python.  This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.

Fixes #150988
Fixes #153478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
2025-06-19 15:27:15 +00:00
317af4c87b Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)"
This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c.

Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))
2025-06-19 15:09:29 +00:00
ab3393e923 [ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368)
Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-19 15:02:40 +00:00
0b62465b99 Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466)"
This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e.

Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))
2025-06-19 14:25:38 +00:00
fec8af8b98 [bugfix] [build] guard cuda version for ipc with fabric handle (#156394)
https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8).

this pr improves the support by adding some compilation-time check against cuda versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394
Approved by: https://github.com/ngimel
2025-06-19 13:54:01 +00:00
769d754ab2 [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-19 13:22:41 +00:00
8cb0c4a4da [Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586)
This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #150287
2025-06-19 13:17:22 +00:00
83259cf7a7 [Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287)
This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode.
The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion.

The performance improvement:

| Suite       | Inductor Speedup (Baseline) | Inductor Speedup (Compared) | Acc Failed | Perf Failed | Inductor Perf Ratio | Speedup  |
|-------------|-----------------------------|------------------------------|------------|--------------|----------------------|----------|
| Huggingface | 2.134838                    | 2.125740314                  | 0          | 0            | 1.001462504          | 100.43%  |
| Torchbench  | 1.808558                    | 1.675100479                  | 0          | 0            | 1.075722187          | 107.97%  |
| Timm        | 2.343893                    | 2.070476653                  | 0          | 0            | 1.131023832          | 113.21%  |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287
Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel
2025-06-19 13:17:22 +00:00
0504480f37 Add CUDA 12.9 libtorch nightly (#155895)
https://github.com/pytorch/pytorch/issues/155196

with libtorch docker added, we can add the build script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895
Approved by: https://github.com/atalman
2025-06-19 13:15:42 +00:00
ccb1f687d6 Port two dynamo test cases for Intel GPU (#156056)
For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-06-19 12:49:04 +00:00
a8fe982993 Revert "[build] Create target for flash attention (#156235)"
This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0.

Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](6d02321472) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))
2025-06-19 11:47:27 +00:00
4da98351b9 [SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211)
Adds NVSHMEM PUT with Signal operation support for Triton kernels:

- Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block
- Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`)
- Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`)

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set`
`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add`

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_set(self) -> None:
    @triton.jit
    def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                         signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
        nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

    # ... setup code ...

    val = 11
    inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val)
    out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)  # destination buffer

    # Signal flag buffer - starts at 0, will be set to 1 upon completion
    flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

    peer = 1 - rank
    NVSHMEM_SIGNAL_SET = 0  # atomic set operation
    SIGNAL_VAL = 1  # completion signal value

    if rank == 0:
        # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1
        put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                    signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET,
                                    peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   # Rank 1 can check flag to know data transfer completed!
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")
```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([1], device='cuda:1')

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

Working as expected! Data is received, and flag set to 1 for completion signal!

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_add(self) -> None:
   @triton.jit
   def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                        signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
       nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

   # ... setup code ...

   # Signal buffer (uint64 flag)
   flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

   peer = 1 - rank
   NVSHMEM_SIGNAL_ADD = 5  # atomic add operation
   SIGNAL_VAL = 16  # Signal value to add

   if rank == 0:
       # Rank 0 puts into Rank 1 and adds to signal
       put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                   signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD,
                                   peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")

```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([16], device='cuda:1')

----------------------------------------------------------------------
Ran 1 test in 17.145s

OK
```

The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-19 10:24:30 +00:00
348e2a76df s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397
Approved by: https://github.com/laithsakka
2025-06-19 10:18:28 +00:00
02080c2cd9 Fix num_heads inference in ONNX Attention-23 exporter (#156367)
Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105

Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3:

![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a)

But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed.

Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-06-19 09:40:01 +00:00
8fcda2c60d [SymmMem] Add runtime detection of NVSHMEM (#156291)
so that we can pick the default backend for SymmetricMemory without
fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM`

On Python side, the following API is added:
`torch.distributed._symmetric_memory.is_nvshmem_available()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117
2025-06-19 08:26:11 +00:00
eabf7cd3c5 [export] update docs for Dims (#156262)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262
Approved by: https://github.com/angelayi
2025-06-19 06:25:21 +00:00
ec0276103f [PGO] fix whitelist scalar bug (#156194)
Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D76830552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194
Approved by: https://github.com/bobrenjc93
2025-06-19 05:51:21 +00:00
1c960c5638 [Makefile] lazily setup lintrunner on first make lint run (#156058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058
Approved by: https://github.com/ezyang
2025-06-19 05:43:35 +00:00
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
ce8180a61d [c10d] Disable stack trace call in logging (#156362)
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76929722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
2025-06-19 05:11:57 +00:00
a21806f038 [ez][export] Better error message for schema check in torch.export.load (#156361)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156354

torch.export.load() only supports files generated by torch.export.save()

Test Plan:
CI

Rollback Plan:

Differential Revision: D76928725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361
Approved by: https://github.com/zhxchen17
2025-06-19 04:50:56 +00:00
3f69e3b3a0 Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-19 04:50:18 +00:00
3bec588bf5 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
2025-06-19 03:47:41 +00:00
6d02321472 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 03:35:04 +00:00
77518d1a13 [CI] fix xpu-smi hang in XPU test container (#156171)
Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171
Approved by: https://github.com/huydhn
2025-06-19 02:48:11 +00:00
19ffdf4ea0 [dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype.  this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.

This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging.  This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.

Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references.  I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.

The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
 with staging_context(stager):
     cpu_state_dict = copy.deepcopy(state_dict)
```

Also, adds support for pinned-memory.

One problem this implementation does not address is that we lose the original device.

The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.

Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.

The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```

Test Plan:
unit tests

Differential Revision: D75993324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-06-19 02:04:21 +00:00
d4ad280429 Enable querying the build and runtime NCCL versions (#156305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305
Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin
2025-06-19 02:00:08 +00:00
bc9bd2a766 Use linux.2xlarge runner (#156351)
The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-06-19 01:50:56 +00:00
e5a1197191 Fix fx tracing for mark dynamic (#156346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346
Approved by: https://github.com/tony-ivchenko
2025-06-19 01:03:09 +00:00
6959b5febe Context on torch.cuda.memory._record_memory_history max_entries (#155889)
Context on torch.cuda.memory._record_memory_history buffer behavior

## Description

Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?

Fixes #129674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
2025-06-19 00:44:43 +00:00
6303cc41b7 [ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262)
We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing.

Fixes #155045.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262
Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 23:52:35 +00:00
b8c2d4c259 add a corner test case of dynamic sizes for combo kernel (#156035)
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op

Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```

Rollback Plan:

Differential Revision: D76699002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
2025-06-18 22:57:09 +00:00
76d07e919f Unbreak //c10/util:base (#156216)
Missing dep.

Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2025-06-18 22:44:20 +00:00
9bfefda296 [DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092)
Summary:
### Diff Context

1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.

2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.

Test Plan: CI

Differential Revision: D75553096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
2025-06-18 22:41:25 +00:00
a5b4463d60 [nativert] session state (#156190)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76827309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190
Approved by: https://github.com/zhxchen17
2025-06-18 22:40:44 +00:00
6918758f55 [export] Update documents for ExportGraphSiganture (#156244)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156184

The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused.

Test Plan:
Document change only. CI

Rollback Plan:

Differential Revision: D76849097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244
Approved by: https://github.com/yushangdi
2025-06-18 22:37:34 +00:00
1e474cc9c8 [ONNX] Fix how shapes are computed for float4 (#156353)
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.

Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
2025-06-18 22:28:02 +00:00
9afee0fa96 [inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201)
internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif

Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201
Approved by: https://github.com/masnesral
2025-06-18 22:15:32 +00:00
e5a0b73ce9 [MTIA Aten Backend] Migrate logical_and.out (#156286)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_and.out to in-tree

Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286
Approved by: https://github.com/nautsimon, https://github.com/jingsh
ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285
2025-06-18 21:57:05 +00:00
bfccfa0b31 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))
2025-06-18 21:48:50 +00:00
f5eb42e4c0 [nativert] move layoutplanneralgorithm to libtorch (#156205)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76831634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205
Approved by: https://github.com/zhxchen17
2025-06-18 21:46:38 +00:00
d1c924c68a [MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate t.Tensor_out / lt.Scalar_out to in-tree.

Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283, #156284
2025-06-18 21:40:26 +00:00
5c7e1d39ab [MTIA Aten Backend] Migrate logit (#156284)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logit to in-tree.

Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283
2025-06-18 21:36:27 +00:00
706e236b08 [MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_or.out / log.out / log2.out to in-tree.

Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047
2025-06-18 21:27:58 +00:00
ab81fb846c [MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047)
Migrate remainder.Tensor_out / reciprocal.out / neg.out

Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046
2025-06-18 21:17:34 +00:00
c26ce593d8 [MTIA Aten Backend] Migrate nan_to_num.out (#156046)
Migrate nan_to_num.out

Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634
2025-06-18 21:14:13 +00:00
2f1c5c4131 [MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634)
# Context

MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.

# This diff

Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.

The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.

----------------

p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time

Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
2025-06-18 21:10:18 +00:00
e99cc126a4 [AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133)
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid

Rollback Plan:

Differential Revision: D76769711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
2025-06-18 20:53:21 +00:00
728cf6721e Revert "[PT2]load dense delta by trimming prefixes (#155872)"
This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f.

Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))
2025-06-18 20:05:56 +00:00
c74fd35050 [PT2]load dense delta by trimming prefixes (#155872)
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.

Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.

Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}  --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```

Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455

USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```

Rollback Plan:

Differential Revision: D76520301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
2025-06-18 19:13:22 +00:00
48de3da253 fix: avoid flamegraph script setup conflicts (#156310)
Fixes #156309

Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet

Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
2025-06-18 19:06:22 +00:00
cbafba5794 Allow forcing FSDP2 to always use SUM reductions (#155915)
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.

Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.

This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).

Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
2025-06-18 18:57:47 +00:00
9944cd0949 Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520)
Related to #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 18:46:04 +00:00
30d3cf62fb support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680)
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-18 18:39:01 +00:00
aee2bfc5ba [Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194)
As title.
Thanks @anmyachev  for the work on compatibility adaptation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194
Approved by: https://github.com/jansel
2025-06-18 18:17:07 +00:00
2620361d19 Add batching rule for torch.matrix_exp (#155202)
## Summary

Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.

Fixes #115992

## Details

`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.

## Testing

The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback

**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
2025-06-18 17:35:35 +00:00
eqy
a5f59cc2ea [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel
2025-06-18 17:32:36 +00:00
94f8679019 Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)"
This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365.

Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))
2025-06-18 16:52:19 +00:00
36f7a027b5 [MPS] Implement upsample_trilinear as Metal shader (#156263)
But only forward for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263
Approved by: https://github.com/dcci
ghstack dependencies: #156256, #156090
2025-06-18 16:10:02 +00:00
bf06190e21 Integrated AMD AWS runners into Pytorch CI (#153704)
Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd

Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>
2025-06-18 15:58:22 +00:00
ce3406817d Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit fe37db4f1270745d6c523623143332ddf263af55.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))
2025-06-18 15:53:32 +00:00
c5d3e7a4ff Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))
2025-06-18 15:51:06 +00:00
408d9884b0 Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))
2025-06-18 15:47:21 +00:00
6201981f48 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))
2025-06-18 15:43:22 +00:00
d290fe7690 Remove legacy export testing path (#156093)
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76734378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
2025-06-18 15:36:44 +00:00
7531bd6491 [ROCm] upgrade to 6.4.1 patch release (#156112)
Fixes #155292.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 15:21:44 +00:00
830a335a7d Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-18 15:15:05 +00:00
6d3a4356f6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
2025-06-18 14:38:55 +00:00
c177abd217 Disable pinning check when loading sparse tensors (#154638)
Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled.

Fixes https://github.com/pytorch/pytorch/issues/153143 .

For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-18 14:33:36 +00:00
8f02161d10 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))
2025-06-18 14:19:39 +00:00
b30e04b3c8 Make the NCCL PG Options and Config copyable and safe to init standalone (#155700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700
Approved by: https://github.com/kwen2501
2025-06-18 13:36:27 +00:00
1bb9b1858b [CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174)
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
2025-06-18 13:17:46 +00:00
d99cac2816 [Kineto][submodule] Update kineto pin for XPU toggle feature (#155488)
Part of #154898
Update kineto submodule

Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-18 12:39:58 +00:00
c11888e7a6 Skip more tests on s390x (#155210)
Make CI for s390x green before fixing and restoring tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210
Approved by: https://github.com/seemethere
2025-06-18 12:07:17 +00:00
402ae09e41 [BE] fix typos in c10/ (#156078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-06-18 10:24:44 +00:00
f45f483884 [user triton] AOT Inductor support for new host-side TMA api (#155879)
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.

Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.

What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.

Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2}  and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
2025-06-18 09:35:11 +00:00
577baa4116 [c10d] Add a logger for all nccl collectives with its time duration when completed (#156008)
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
2025-06-18 09:08:42 +00:00
c5a4fe9c17 [CI] fix the ci image name for public copy in ghcr (#156169)
After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169
Approved by: https://github.com/malfet
2025-06-18 08:16:56 +00:00
a6a3a44144 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-18 07:27:20 +00:00
614a415145 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-18 07:27:20 +00:00
3c8c48f793 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-18 07:27:09 +00:00
920f6e681e [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-18 07:27:00 +00:00
fe37db4f12 [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-18 07:26:52 +00:00
ccc6279b40 flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719)
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198

There are two problems:

(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.

This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.

Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.

(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.

**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.

This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-06-18 07:02:04 +00:00
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
96df866410 [audio hash update] update the pinned audio hash (#156259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259
Approved by: https://github.com/pytorchbot
2025-06-18 06:02:46 +00:00
a5df6ffbc2 Improve IPC for Expandable Segments to use fabric handle when possible (#156074)
Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-06-18 05:22:06 +00:00
29867b211a [cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234)
When using docker with cutlass backend, we can get
```
No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions'
```
First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888

Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-06-18 05:03:43 +00:00
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
a82c171bb2 remove skipifrocm from composability tests (#156036)
Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too

tested locally
<img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036
Approved by: https://github.com/jeffdaily
2025-06-18 04:24:42 +00:00
9ed0060225 Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164)
There are a few considerations here:

1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404

2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.

3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.

4. There are two foot guns:

   a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).

   b. The following seuquence won't work as intended:

```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```

This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.

I think these two foot guns are probably okay given that this a bit of an experts' API.

Fixes #155106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
2025-06-18 03:39:28 +00:00
17b38b850e [ca] Allow using compiled autograd context managers during backward runtime (#156120)
Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check.

FIXES https://github.com/pytorch/pytorch/issues/152219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120
Approved by: https://github.com/jansel
ghstack dependencies: #155521, #155480
2025-06-18 03:01:15 +00:00
10d41c7d20 Add SDPA patterns for T5 models (#155455)
* Add SDPA patterns for T5 models.
* Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-06-18 02:09:55 +00:00
4851863e3f fix hack to check if register_buffer has been overridden (#155963)
Followup on https://github.com/pytorch/pytorch/pull/125971

`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!

Example demonstration:

```python
class Base:
    def register_buffer(self, buffer):
        pass

class InheritedOk(Base):
    pass

class InheritedOverride(Base):
    def register_buffer(self, buffer):
        pass

b = Base()
ok = InheritedOk()
override = InheritedOverride()

print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False

print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```

(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
2025-06-18 01:50:30 +00:00
202d2ae53a Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430)
Fixes #155033

- [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst)
- [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst)
- [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst)
- [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size.
- [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 01:27:04 +00:00
68996dc183 [BE][2/X] Phase out usage of use_max_autotune() (#155848)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848
Approved by: https://github.com/masnesral
2025-06-18 01:18:09 +00:00
e8bfce9a43 Document how to use stack-based APIs with StableIValue (#155984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-06-18 01:10:23 +00:00
541297daae [Build] Allow metal shaders to include ATen headers (#156256)
No-op change that will be used later to share structs between CPU and Metal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256
Approved by: https://github.com/dcci
2025-06-18 01:03:25 +00:00
3dabc351bb [Break XPU] Fix XPU UT failures introduced by community. (#156091)
Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091
Approved by: https://github.com/jansel
2025-06-17 23:43:37 +00:00
38e1e5d54c Add get_pipeline_order() for Gpipe and 1F1B (#155935)
The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR.

The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935
Approved by: https://github.com/wconstab
2025-06-17 23:39:17 +00:00
5435e75399 [ez] rename choice_timings -> choice_timings_fn (#156099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099
Approved by: https://github.com/mlazos
ghstack dependencies: #155982, #155996, #156053
2025-06-17 23:30:27 +00:00
12b02137af [MPS] Add benchmark for scan operations (#156241)
Comparison of cumsum performance before and after Metal implementaton:

Previous performance (using torch==2.7.1):
```[-------------------------------  -------------------------------]
                                              |  eager  |  compile
1 threads: -------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |  131.0  |   136.9
      cumsum-dim0-128x128 (torch.float16)     |  116.9  |   121.2
      cumsum-dim0-512x512 (torch.float16)     |  132.5  |   151.9
      cumsum-dim0-1024x1024 (torch.float16)   |  150.0  |   163.0
      cumsum-dim1-32x32 (torch.float16)       |  125.9  |   140.9
      cumsum-dim1-128x128 (torch.float16)     |  116.4  |   129.4
      cumsum-dim1-512x512 (torch.float16)     |  135.9  |   150.1
      cumsum-dim1-1024x1024 (torch.float16)   |  139.5  |   154.2
      cumsum-1d-100 (torch.float16)           |  119.5  |   127.1
      cumsum-1d-10000 (torch.float16)         |  128.9  |   142.5
      cumsum-1d-1000000 (torch.float16)       |  140.6  |   145.6
      cumsum-dim0-32x32 (torch.float32)       |  115.7  |   132.5
      cumsum-dim0-128x128 (torch.float32)     |  118.0  |   131.5
      cumsum-dim0-512x512 (torch.float32)     |  138.8  |   151.6
      cumsum-dim0-1024x1024 (torch.float32)   |  155.5  |   164.2
      cumsum-dim1-32x32 (torch.float32)       |  127.2  |   141.7
      cumsum-dim1-128x128 (torch.float32)     |  117.7  |   130.5
      cumsum-dim1-512x512 (torch.float32)     |  138.2  |   152.3
      cumsum-dim1-1024x1024 (torch.float32)   |  144.4  |   158.6
      cumsum-1d-100 (torch.float32)           |  118.6  |   128.0
      cumsum-1d-10000 (torch.float32)         |  125.5  |   141.5
      cumsum-1d-1000000 (torch.float32)       |  143.9  |   158.4
      cumsum-dim0-32x32 (torch.bfloat16)      |  106.6  |   137.6
      cumsum-dim0-128x128 (torch.bfloat16)    |  118.1  |   131.0
      cumsum-dim0-512x512 (torch.bfloat16)    |  140.0  |   154.3
      cumsum-dim0-1024x1024 (torch.bfloat16)  |  153.2  |   164.4
      cumsum-dim1-32x32 (torch.bfloat16)      |  127.9  |   132.6
      cumsum-dim1-128x128 (torch.bfloat16)    |  116.5  |   129.6
      cumsum-dim1-512x512 (torch.bfloat16)    |  136.5  |   151.2
      cumsum-dim1-1024x1024 (torch.bfloat16)  |  139.8  |   144.8
      cumsum-1d-100 (torch.bfloat16)          |  115.7  |   129.4
      cumsum-1d-10000 (torch.bfloat16)        |  125.0  |   143.3
      cumsum-1d-1000000 (torch.bfloat16)      |  127.8  |   143.4

Times are in microseconds (us).
```

Current performance:
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |    107.4  |    123.8
      cumsum-dim0-128x128 (torch.float16)     |    134.2  |    145.8
      cumsum-dim0-512x512 (torch.float16)     |    207.3  |    231.6
      cumsum-dim0-1024x1024 (torch.float16)   |    318.9  |    355.3
      cumsum-dim1-32x32 (torch.float16)       |     98.0  |    114.3
      cumsum-dim1-128x128 (torch.float16)     |    110.8  |    121.6
      cumsum-dim1-512x512 (torch.float16)     |    193.0  |    209.1
      cumsum-dim1-1024x1024 (torch.float16)   |    844.7  |    870.8
      cumsum-1d-100 (torch.float16)           |    108.4  |    125.0
      cumsum-1d-10000 (torch.float16)         |    784.7  |    852.3
      cumsum-1d-1000000 (torch.float16)       |  65855.2  |  66725.9
      cumsum-dim0-32x32 (torch.float32)       |    114.7  |    115.7
      cumsum-dim0-128x128 (torch.float32)     |    139.0  |    151.6
      cumsum-dim0-512x512 (torch.float32)     |    197.3  |    208.0
      cumsum-dim0-1024x1024 (torch.float32)   |    312.7  |    332.9
      cumsum-dim1-32x32 (torch.float32)       |     92.0  |    110.8
      cumsum-dim1-128x128 (torch.float32)     |    114.2  |    125.0
      cumsum-dim1-512x512 (torch.float32)     |    186.2  |    196.1
      cumsum-dim1-1024x1024 (torch.float32)   |    752.0  |    825.0
      cumsum-1d-100 (torch.float32)           |    112.4  |    122.0
      cumsum-1d-10000 (torch.float32)         |    793.5  |    863.5
      cumsum-1d-1000000 (torch.float32)       |  66431.8  |  66040.0
      cumsum-dim0-32x32 (torch.bfloat16)      |    111.6  |    121.6
      cumsum-dim0-128x128 (torch.bfloat16)    |    139.0  |    138.4
      cumsum-dim0-512x512 (torch.bfloat16)    |    217.6  |    230.1
      cumsum-dim0-1024x1024 (torch.bfloat16)  |    305.2  |    325.6
      cumsum-dim1-32x32 (torch.bfloat16)      |    100.5  |    110.9
      cumsum-dim1-128x128 (torch.bfloat16)    |    112.8  |    125.0
      cumsum-dim1-512x512 (torch.bfloat16)    |    187.8  |    208.9
      cumsum-dim1-1024x1024 (torch.bfloat16)  |    790.9  |    864.7
      cumsum-1d-100 (torch.bfloat16)          |    111.6  |    124.6
      cumsum-1d-10000 (torch.bfloat16)        |    778.1  |    844.9
      cumsum-1d-1000000 (torch.bfloat16)      |  64654.3  |  64082.5

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
2025-06-17 22:30:22 +00:00
fa4f07b5b8 Revert "[Docs] Convert to markdown to fix 155032 (#155520)"
This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-06-17 22:22:50 +00:00
54998c2daa Document padding size limitations in nn.modules.padding (#134840) (#155618)
Fixes #134840

Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:

- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive

These changes help prevent runtime errors when users attempt to use large padding values.

## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
2025-06-17 22:16:48 +00:00
937529f0b3 Pass by const ref instead of by value in StableIValue from (#156126)
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.

I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).

Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
2025-06-17 22:11:30 +00:00
4c0aa37dda Support stream capture of event record and wait nodes in cuda graphs (#155372)
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
2025-06-17 21:44:51 +00:00
8e02cd9c5a Skip cache related configs for cache config serialization (#156195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195
Approved by: https://github.com/masnesral
2025-06-17 21:24:07 +00:00
3106a33e41 [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156)
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.

Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.

Rollback Plan:

Differential Revision: D76775373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
2025-06-17 21:17:58 +00:00
bb462a6237 [cutlass backend] Fix prescreening non-deterministic problem (#156144)
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)

What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.

However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.

With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
2025-06-17 20:39:06 +00:00
cd66ff8030 [Docs] Convert to markdown to fix 155032 (#155520)
Fix #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-17 20:29:45 +00:00
50940270ae [BE][3/X] Phase out usage of use_max_autotune() (#155849)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849
Approved by: https://github.com/masnesral
2025-06-17 20:26:29 +00:00
b020971e78 [BE] fix typos in torchgen/ (#156083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083
Approved by: https://github.com/jingsh
ghstack dependencies: #156079, #156082
2025-06-17 19:25:50 +00:00
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
ccea6ddac3 [BE] fix typos in cmake/ (#156079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079
Approved by: https://github.com/Skylion007
2025-06-17 19:25:43 +00:00
5eb5c3700b [ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525)
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.

`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2

Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1,  rocSolver <3.26)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
2025-06-17 19:20:36 +00:00
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
4a26bb8a12 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-17 18:59:44 +00:00
fc177801af Enable FP8 row-wise scaled-mm for sm12x (#155991)
## Update using Cutlass 3.x (2025/06/15)

Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91

This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.

## Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).

Simple benchmark script

```python
import torch
from torch._inductor.utils import do_bench_using_profiling

N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
    A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
    B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
    scale_A = torch.ones(M, 1).cuda()
    scale_B = torch.ones(1, N).cuda()

    out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
    out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
    torch.testing.assert_close(out, out_ref)

    latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
    tflops = (2 * M * N * K) / latency_us / 1e9
    print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```

M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS

Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53

Better than default configs, but still far away from the config above for compute-bound

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-06-17 18:52:44 +00:00
e323d46b61 ELU: compute ELU(0) with the cheaper definition (#155765)
Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.)

Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765
Approved by: https://github.com/ezyang
2025-06-17 18:20:22 +00:00
8b0e0e4f23 [dynamo] Support tracing of functools.lru_cached method (#156125)
Fixes https://github.com/pytorch/pytorch/issues/155841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125
Approved by: https://github.com/williamwen42
2025-06-17 18:11:32 +00:00
fc5ae12293 Fix issue with right-nav (#156119)
Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example:
* Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty.
* Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed
<img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" />
vs
<img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119
Approved by: https://github.com/albanD
2025-06-17 18:09:51 +00:00
32c1611263 [CI][run_test] Fix rerun logic for failing at exit (#155853)
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.

The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file).  This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted.  But sometimes pytest reports a success, but the process has non zero exit code.  The two cases I know of are hangs and double freeing at exit.  In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option).  But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly

Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds.  This is correct even in the normal path because initial value shouldn't change in that case

Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this

I think I have a PR that fixes the above but its old so I need to take another look

Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
2025-06-17 17:51:40 +00:00
6629eaf0c6 [CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193)
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`

Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
2025-06-17 17:49:33 +00:00
a4ea242edc [MPS] Implement scan metal kernels (#156100)
Implements metal kernels for scan operations:
- Migrates cumsum and cumprod from MPSGraph implementation to Metal.
- Fixes #154881
- Adds MPS backend support for cummin and cummax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100
Approved by: https://github.com/malfet
2025-06-17 17:44:22 +00:00
9a5c59368d Replace all RAIIATH with Tensor in libtorch_agnostic test, test some APIs (#155977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155977
Approved by: https://github.com/albanD
ghstack dependencies: #155367
2025-06-17 17:36:31 +00:00
b115a4c03a torch::stable::Tensor beginnings, mainly mem mgmt (#155367)
```
// The torch::stable::Tensor class is a highlevel C++ header-only wrapper around
// the C shim Tensor APIs. We've modeled this class after TensorBase, as custom
// op kernels only really need to interact with Tensor metadata (think sizes,
// strides, device, dtype). Other functions on Tensor (like empty_like) should
// live like the ATen op that they are and exist outside of this struct.
//
// There are several goals of this class over AtenTensorHandle and
// RAIIAtenTensorHandle:
// 1. torch::stable::Tensor is a nicer UX much closer to torch::Tensor than the
//    C APIs with AtenTensorHandle. Under the hood we still call to these C shim
//    APIs to preserve stability.
// 2. RAIIAtenTensorHandle boils down to a uniq_ptr that forces the user to pass
//    around ownership. This makes it difficult to pass one input into 2
//    different functions, e.g., doing something like c = a(t) + b(t) for
//    stable::Tensor t. Thus, we use a shared_ptr here.
```

This PR:
- exemplifies the above
- adds test cases in libtorch_agnostic to make sure the file actually works
- includes the results of a battle with template specialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155367
Approved by: https://github.com/albanD
2025-06-17 17:36:31 +00:00
2625c70aec Update CODEOWNERS (#156182)
as title says. removing me as codeowner for cpp extensions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156182
Approved by: https://github.com/albanD
2025-06-17 17:15:41 +00:00
a24afbff3f Support torch.cuda.*Tensor in Dynamo (#156107)
Summary:
This PR adds support for torch.cuda.FloatTensor and friends in Dynamo.
These are indeed legacy APIs, but that doesn't stop us from adding
support for them in torch.compile.

I add support for these in the same way that we support torch.Tensor:
these APIs can be safely put into the Dynamo graph.

Fixes #130722

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156107
Approved by: https://github.com/williamwen42
2025-06-17 16:31:10 +00:00
9a2c669425 [inductor][cutlass] binary remote cache (#156106)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

Test Plan:
## prove that we can upload successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Rollback Plan:

Reviewed By: henrylhtsang

Differential Revision: D76454741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106
Approved by: https://github.com/henrylhtsang

Co-authored-by: atalman <atalman@fb.com>
2025-06-17 16:24:10 +00:00
d66b4bcc3f [inductor][triton pin] Support triton builtins after #7054 (#156031)
Triton's PR 7054 modifies the builtins to take _semantic as a kwarg instead of _builder.

To handle this, this PR checks the signature of tl.core.view (to see if it takes _builder or _semantic), and adds a wrapper converting _semantic to _builder if the new _semantic kwarg is being used.

(Previously-)failing test: `python test/inductor/test_cooperative_reductions.py -k test_welford_non_power_of_2_rsplit_persistent_True_x_9_r_8000_rsplit_37`

Differential Revision: [D76801240](https://our.internmc.facebook.com/intern/diff/D76801240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156031
Approved by: https://github.com/NikhilAPatel
2025-06-17 16:09:55 +00:00
d083841c0e Fix a small sphinx markup error (#156061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156061
Approved by: https://github.com/colesbury
2025-06-17 15:36:02 +00:00
0079c80b35 [CI] Do not constrain memory for ROCm testing in CI (#156115)
Fixes ROCm OOMs introduced by https://github.com/pytorch/pytorch/pull/155631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156115
Approved by: https://github.com/jeffdaily
2025-06-17 15:30:36 +00:00
7fcad0231c [Docs] Convert to markdown to fix 155025 (#155789)
Related to #155025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789
Approved by: https://github.com/svekars
2025-06-17 15:08:14 +00:00
4886ba64dc [BE] Refactor functions from optional_submodules (#155954)
And use `pathlib.Path` instead of `os.path`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155954
Approved by: https://github.com/Skylion007
ghstack dependencies: #155947
2025-06-17 14:41:52 +00:00
cf90c9f8d1 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel, https://github.com/cyyever

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-17 14:15:49 +00:00
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
0a0023d984 Enable NCCL zero-copy (user buffer registration) for FSDP2 (#150564)
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
2025-06-17 12:54:58 +00:00
11bb1ece50 [CI] Fix triton version split issue (#155670)
Fix a bug caused by #155313, refer https://github.com/pytorch/pytorch/actions/runs/15576592378/job/43862613039?pr=154194#step:7:652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155670
Approved by: https://github.com/atalman, https://github.com/EikanWang
2025-06-17 12:42:40 +00:00
1cce73b5f4 [build] Change --cmake{,-only} arguments to envvars to support modern Python build frontend (#156045)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045
Approved by: https://github.com/ezyang
ghstack dependencies: #156040, #156041
2025-06-17 11:40:24 +00:00
57084ca846 [BE][setup] allow passing pytorch-specific setup.py options from envvars (#156041)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156041
Approved by: https://github.com/ezyang
ghstack dependencies: #156040
2025-06-17 11:40:24 +00:00
092aed1b18 [Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992)
In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v.
In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR.   SDPA UTs pass in local test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992
Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 11:09:51 +00:00
4a8f5e752b [FSDP2] explain user contract for fully_shard (#156070)
<img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070
Approved by: https://github.com/mori360
2025-06-17 10:03:19 +00:00
8d7ee0f4fb [BE] fix typos in .ci/, .circleci/, .github/ (#156069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156069
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-17 09:43:14 +00:00
2e0e08588e [BE][PYFMT] migrate PYFMT for torch/[e-n]*/ to ruff format (#144553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553
Approved by: https://github.com/ezyang
ghstack dependencies: #144551
2025-06-17 08:18:47 +00:00
cyy
95cb42c45d Use CMAKE_COLOR_DIAGNOSTICS (#154583)
`CMAKE_COLOR_DIAGNOSTICS` was introduced in CMake 2.24. Use it to simplify CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154583
Approved by: https://github.com/ezyang
2025-06-17 04:52:31 +00:00
cyy
d43c0bdf46 [CI] Move ASAN jobs to clang-18 (#149099)
Use clang-18 for ASAN jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149099
Approved by: https://github.com/ezyang
2025-06-17 04:51:07 +00:00
7b0118884e [invoke_subgraph][inductor] Dont fallback on complex dtype (#155885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155885
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #155828
2025-06-17 04:47:12 +00:00
ffcc6fea7b [invoke_subgraph] Ignore redundantly nested invoke_subgraph (#155828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155828
Approved by: https://github.com/zou3519
2025-06-17 04:47:12 +00:00
b1713c6655 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
ghstack dependencies: #156121
2025-06-17 04:46:26 +00:00
82672206b7 [SymmMem] Make get_rank_to_global_rank return const ref (#156117)
Avoiding a copy, not expecting a caller to change its value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156117
Approved by: https://github.com/fegin
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116
2025-06-17 04:13:18 +00:00
eea3bcb3d1 [SymmMem] Cache rank_to_global_rank exchange (#156116)
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.

Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```

After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
2025-06-17 04:12:37 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
45382b284d [cutlass backend] changes how gpu_kernels_o are handled for cutlass (#155875)
Currently, we do it a bit hacky: Look at all the .o we have from this session, add them all to AOTI. This for example doesn't work if we do multiple AOTI compilation in one session, without clearing the inductor cache.

Also I want to change how cutlass .so are compiled. Hence this change.

This change is broken down since @coconutruben is trying to make a change to the same files too.

Differential Revision: [D76563003](https://our.internmc.facebook.com/intern/diff/D76563003/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155875
Approved by: https://github.com/ColinPeppler
2025-06-17 02:06:54 +00:00
cyy
64bb6317a5 [Accelerator] Fix Python typing in accelerator (#152394)
There are some changes:
1. Use keywords for arguments if possible.
2. `__exit__ ` of `device_index` is changed to return None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152394
Approved by: https://github.com/XuehaiPan, https://github.com/guangyey, https://github.com/ezyang

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 01:27:40 +00:00
1f0eb79e3e [dynamo] fix KeyError in LOAD_FAST_CHECK (#155763)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155763
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #155761
2025-06-17 00:54:16 +00:00
4e833c2005 [dynamo] support tracing weakref callback (#155761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155761
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-17 00:54:16 +00:00
e6252f62ef [ONNX] Implements converter for higher order ops scan (#154513)
Fixes #151327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154513
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-06-17 00:54:07 +00:00
b618817479 [PGO] include ints/floats in suggested whitelist (#155980)
Made the mistake of dropping these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155980
Approved by: https://github.com/bobrenjc93
2025-06-17 00:41:38 +00:00
4311aea5e7 [AOTInductor] Add class declarations to torch._C._aoti interface file (#155128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155128
Approved by: https://github.com/desertfire
ghstack dependencies: #155149
2025-06-17 00:10:57 +00:00
82fb904140 Add warning for incorrected grad results at world size 1 (#154928)
Add warning for the issue discussed at https://github.com/pytorch/pytorch/issues/144045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154928
Approved by: https://github.com/weifengpy
2025-06-17 00:08:04 +00:00
eb4cf59ecd Add FSDP2 logging (#155826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155826
Approved by: https://github.com/weifengpy
2025-06-16 23:49:58 +00:00
6e2992a998 Remove unused Azure pipeline trigger script (#156134)
## Summary
- delete `.circleci/scripts/trigger_azure_pipeline.py`

## Testing
- `python3 -m pip install flake8`
- `python3 -m flake8 .circleci/scripts`

------
https://chatgpt.com/codex/tasks/task_e_6850a55f530c83279036800308fb6871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156134
Approved by: https://github.com/izaitsevfb
2025-06-16 23:42:52 +00:00
4781b0ee60 [SymmMem] Add NVSHMEM GET support to Triton (#155890)
Adds NVSHMEM GET operation support for Triton kernels:

- Add `getmem_block` core.extern wrapper for nvshmemx_getmem_block
- Add basic `test_triton_get` for 2-rank GET operation
- Add `test_triton_get_ring` for ring topology GET across arbitrary ranks

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_get`

```python
@skipIfRocm
@requires_triton()
def test_triton_get(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   val = 7
   inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(
       val if rank == 0 else -1
   )
   out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)

   peer = 1 - rank
   if rank == 1:
       # Rank 1 gets data from rank 0
       get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")
```

```

[Rank 0] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 1] inp buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:1', dtype=torch.int8)
...
[Rank 1] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:1', dtype=torch.int8)
...
[Rank 1] got data from peer 0

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

```python
@skipIfRocm
@requires_triton()
def test_triton_get_ring(self) -> None:
   @triton.jit
   def get_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
       nvshmem.getmem_block(dst_ptr, src_ptr, numel, peer)

   # ... setup code ...

   # Ring topology: each rank gets data from the rank to its left
   peer = (rank - 1) % world_size

   # All ranks execute the get operation
   get_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] got data from peer {peer}")

```

```
Output (8 GPUs):

[Rank 0] inp buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int8)
[Rank 2] inp buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:2', dtype=torch.int8)
[Rank 5] inp buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:5', dtype=torch.int8)
[Rank 6] inp buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:6', dtype=torch.int8)
[Rank 3] inp buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:3', dtype=torch.int8)
[Rank 1] inp buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:1', dtype=torch.int8)
[Rank 2] out buffer: tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:2', dtype=torch.int8)
[Rank 5] out buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:5', dtype=torch.int8)
[Rank 0] out buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:0', dtype=torch.int8)
[Rank 2] got data from peer 1
[Rank 5] got data from peer 4
[Rank 0] got data from peer 7
[Rank 7] inp buffer: tensor([7, 7, 7, 7, 7, 7, 7, 7], device='cuda:7', dtype=torch.int8)
[Rank 6] out buffer: tensor([5, 5, 5, 5, 5, 5, 5, 5], device='cuda:6', dtype=torch.int8)
[Rank 3] out buffer: tensor([2, 2, 2, 2, 2, 2, 2, 2], device='cuda:3', dtype=torch.int8)
[Rank 6] got data from peer 5
[Rank 3] got data from peer 2
[Rank 1] out buffer: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 4] inp buffer: tensor([4, 4, 4, 4, 4, 4, 4, 4], device='cuda:4', dtype=torch.int8)
[Rank 7] out buffer: tensor([6, 6, 6, 6, 6, 6, 6, 6], device='cuda:7', dtype=torch.int8)
[Rank 7] got data from peer 6
[Rank 4] out buffer: tensor([3, 3, 3, 3, 3, 3, 3, 3], device='cuda:4', dtype=torch.int8)
[Rank 4] got data from peer 3

----------------------------------------------------------------------
Ran 1 test in 41.045s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155890
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-16 23:18:15 +00:00
bb1f3d1a55 [MPSInductor] Improve _default dtype inference (#156121)
By just adding 'mps' as one of the backend options and fixing reduction op to actually return tuple of CSEVariable's rather than tuple of strings

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156121
Approved by: https://github.com/dcci
2025-06-16 23:11:53 +00:00
508cdc4fc9 [BE][4/X] Phase out usage of use_max_autotune() (#155850)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155850
Approved by: https://github.com/masnesral
2025-06-16 23:10:26 +00:00
f2d70898c6 [nativert] Move OpKernel to PyTorch core (#156011)
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test

Rollback Plan:

Differential Revision: D76525939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
2025-06-16 22:53:10 +00:00
35ecd7c2d4 Revert "[Cutlass] Fix buffer missing issues (#155897)"
This reverts commit 9bd42c15707a4b410ee005d5916e882a7db432bb.

Reverted https://github.com/pytorch/pytorch/pull/155897 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/155897#issuecomment-2978391416))
2025-06-16 22:44:39 +00:00
190f76fa31 Revert "Implement guard collectives (#155558)"
This reverts commit 5a5a05a6a3be376130848e235df73b752eef0230.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/malfet due to Hmm, may be I'm looking at the wrong metric, but c92f1075aa/1 shows that test started to pass after PR were reverted ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2978337152))
2025-06-16 22:26:52 +00:00
c92f1075aa Fix if condition for CUDA 12.9 Win build (#156108)
follow-up for https://github.com/pytorch/pytorch/pull/155799/files
Currently the last if condition will be executed for CUDA 12.9, overriding previous CUDA_ARCH_LIST. We should exclude 12.9 from the last if condition to fix this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156108
Approved by: https://github.com/atalman
2025-06-16 21:57:34 +00:00
cce4d213a6 Remove non-header-only dep from c10_headers target (#155858)
It depends on /c10/util:base which is not header-only.

Differential Revision: [D76552750](https://our.internmc.facebook.com/intern/diff/D76552750/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D76552750/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155858
Approved by: https://github.com/ezyang
2025-06-16 21:41:25 +00:00
a24ce67dee [ez] fix grammar error in comment (#156053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156053
Approved by: https://github.com/jingsh
ghstack dependencies: #155982, #155996
2025-06-16 20:53:07 +00:00
247113e03e Add size_hint_or_throw (#155615)
## Summary
`TypeError("Cannot convert symbols to int")` is coming up more recently since more unbacked symints are making its way into Inductor. See https://github.com/pytorch/pytorch/issues/155484
- One way to deal with this is to add `size_hint_or_throw` to throw if we try to pull a hint from an unbacked expr.
- Then, repurpose `size_hint` to accommodate unbacked symints by setting a default fallback or adding an appropriate fallback for each callsite.

This PR adds `size_hint_or_throw` which will throw if unbacked symints exist
- use `size_hint_or_throw` -- usually when the callee can try/catch the exception or guards against unbacked symints

------
with Codex
https://chatgpt.com/codex/tasks/task_e_684869dfc740832882c88d05534cc8f9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155615
Approved by: https://github.com/ezyang, https://github.com/laithsakka, https://github.com/jingsh

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-06-16 20:46:51 +00:00
008345be9d Fix #155018 (convert distributed rst to markdown) (#155528)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155018

Docs comparison (check out the 'new' whenever docs build)

1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html))
2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html))
3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html))
4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html))
5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528
Approved by: https://github.com/wz337, https://github.com/svekars
2025-06-16 20:46:09 +00:00
eb2af14f8e [PT2][partitioners] Add aten.split to view_ops list [relanding #155424] (#155943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155943
Approved by: https://github.com/ShatianWang
2025-06-16 20:42:54 +00:00
03488d820c Revert "[MPS][Testing][BE] Fix samples for full_like (#156026)"
This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06.

Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](2d832c9587) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))
2025-06-16 19:50:26 +00:00
6d2155db49 [PGO] no code state update on dynamic=False (#155961)
Summary:
When tensor size changes are detected on `dynamic=False`, overwrites the PGO state with the newest static shapes to reflect the latest frame state, instead of updating automatic dynamic.

A longer term solution, if we move to shared PGO state between multiple jobs, would be to update automatic dynamic, but avoid suggesting/logging the whitelist (compiling with `dynamic=False` should already override any dynamic PGO that's read, so we're fine there). This way if any particular job runs with `dynamic=False`, it won't statically overwrite the entire PGO state if it's shared with many other jobs.

Test Plan:
test/dynamo/test_pgo.py

Rollback Plan:

Differential Revisi,on: D76630499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155961
Approved by: https://github.com/bobrenjc93
2025-06-16 19:47:55 +00:00
5a5a05a6a3 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 19:46:16 +00:00
61b271e0f3 Revert "Implement guard collectives (#155558)"
This reverts commit 38e5e81e55fc5d85d6cf8a83c96c88578995e3fe.

Reverted https://github.com/pytorch/pytorch/pull/155558 on behalf of https://github.com/atalman due to Breaks CI, sorry: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683161593/job/44181274826) [HUD commit link](38e5e81e55) ([comment](https://github.com/pytorch/pytorch/pull/155558#issuecomment-2977871178))
2025-06-16 19:40:46 +00:00
7cf38d2a05 Make benchmark by op for TS model work with sample inputs (#155988)
Summary: Add pickle input type to allow for running ptvsc2_predictor_bench to get individual node benchmarks for SR

Test Plan:
```
buck2 run mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/data/users/georgiaphillips/models/742055223/1/742055223_1.predictor.local --pt_inputs=/data/users/georgiaphillips/models/742055223/0/mix.pt --pt_enable_static_runtime=1 --compare_results=0 --iters=1000 --warmup_iters=100 --num_threads=1 --do_profile=1 --method_name=${MODULE_NAME}.forward --set_compatibility --do_benchmark=1 --pytorch_predictor_default_model_id=${MODEL_ENTITY_ID}_${SNAPSHOT_ID} --input_type=pickle
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76554920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155988
Approved by: https://github.com/dolpm
2025-06-16 19:15:07 +00:00
2dc1627451 [doc] Add documentation for division by zero behavior in autograd (#155987)
Fixes #128796

This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:

1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
   - Masking before division
   - Using MaskedTensor (experimental API)

The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.

This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.

dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
2025-06-16 19:02:12 +00:00
907d0931cc [ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480)
For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480
Approved by: https://github.com/jansel
ghstack dependencies: #155521
2025-06-16 18:45:03 +00:00
9ff9c28fe8 [ca] Functionalize AccumulateGrad (#155521)
This PR changes compiled autograd's handling of gradient accumulation, by proxying it as a `call_accumulate_grad`, which does the .grad mutation in python bytecode for dynamo to see. For eager, the only change is the leaf invariant check was moved up.

Before:
- Compiled Autograd Engine: proxies call to inductor accumulate_grad op
- Dynamo: polyfills the inductor accumulate_grad op (not respecting all of the accumulateGrad implementation e.g. sparse, gradient layout contract)
```python
        new_grad_strided: "f32[s21]" = torch.empty_like(getitem_1);  getitem_1 = None
        copy_: "f32[s21]" = new_grad_strided.copy_(aot3_tangents_1);  copy_ = None
```
- AOTAutograd: functionalizes the copy_

After:
- Compiled Autograd Engine: proxies call to `call_accumulate_grad`, which calls `torch._dynamo.compiled_autograd.ops.AccumulateGrad`/`AccumulateGrad_apply_functional_no_hooks_ivalue`, similar to other functional autograd implementations, but also sets .grad from python. Hooks are still handled separately from this call.
- Dynamo: `torch._dynamo.compiled_autograd.ops.AccumulateGrad` was allow_in_graph'd
- AOTAutograd: traces into the op, with FunctionalTensors.

While functionalizing the tensors, we insert an autograd Error node to ensure that we don't use the autograd meta from tracing. This clashes with the "leaf variable has been moved into the graph interior" error check, I could not find a way to identify a FunctionalTensor subclass from C++, so I bypass that for Error nodes in the compiled case.

In the CI PR, this fixes 19 tests relating to sparse tensors, and more are hidden by an earlier failure in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155521
Approved by: https://github.com/jansel
2025-06-16 18:45:02 +00:00
42ff6a4a5c [Inductor] Delay codegen for fallback arguments and improve typing (#154371)
Delays code generation for arguments to fallback ops.  This is inspired by #155642, and likely fixes similar memory leaks.

Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-06-16 18:00:04 +00:00
4162c0f702 [BE][setup] gracefully handle envvars representing a boolean in setup.py (#156040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156040
Approved by: https://github.com/malfet
2025-06-16 17:56:31 +00:00
f48a157660 [aoti] Add more to error message (#155974)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155974
Approved by: https://github.com/yushangdi
2025-06-16 17:49:52 +00:00
fbd88ae2b5 Convert to markdown: checkpoint.rst (#156009)
Related to #155014

Use two commits to have a try.
```bash
 1800  git mv docs/source/checkpoint.rst docs/source/checkpoint.md
 1802  git commit -m "[Docs] Rename checkpoint.rst"
 1803  git push origin ckpoint

# update the markdown file
 1805  git add .
 1806  git commit -m "modify checkpoint.md"
 1807  git push origin ckpoint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009
Approved by: https://github.com/svekars
2025-06-16 17:48:23 +00:00
a10024d7de Convert complex_numbers.rst to markdown (#156039)
Related to #155014

Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039
Approved by: https://github.com/svekars
2025-06-16 17:24:37 +00:00
e9fdaf8701 Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)"
This reverts commit e375d21bb9b0ef6fefe7a8af5a054a17de8c63c9.

Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/malfet due to Looks like it broke ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-2977428354))
2025-06-16 17:22:55 +00:00
45596ec58f Delete tools/onnx/update_default_opset_version.py (#156055)
The tool is no longer relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156055
Approved by: https://github.com/titaiwangms
2025-06-16 17:21:36 +00:00
365ce465f3 Revert "[C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)"
This reverts commit 8142a0286016e63a0e91b5667e1fb1a5e868ffd7.

Reverted https://github.com/pytorch/pytorch/pull/155900 on behalf of https://github.com/clee2000 due to causing some sort of hang? in test_distributed_spawn [GH job link](https://github.com/pytorch/pytorch/actions/runs/15678895788/job/44168117193) [HUD commit link](8142a02860) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/155900#issuecomment-2977365699))
2025-06-16 16:59:25 +00:00
2a4e357192 Fix compilation warning with gcc14 (#155934)
Note that nccl still doesn't work so you have to build with `USE_NCCL=0` @eqy is that something being tracked there?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155934
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-06-16 16:43:15 +00:00
503362d019 Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)"
This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968.

Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))
2025-06-16 15:13:53 +00:00
b8d96c3f78 Revert "[cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)"
This reverts commit 47c8810b5275179833d6b33ca3d70922f485272c.

Reverted https://github.com/pytorch/pytorch/pull/154170 on behalf of https://github.com/atalman due to failing torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/154170#issuecomment-2976990461))
2025-06-16 14:59:01 +00:00
013dfeabb4 [BE] fix typos in top-level files (#156067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156067
Approved by: https://github.com/malfet
ghstack dependencies: #156066
2025-06-16 14:56:07 +00:00
6c493e2b14 [BE] add codespell linter (#156066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156066
Approved by: https://github.com/malfet
2025-06-16 14:56:07 +00:00
2d832c9587 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
2025-06-16 14:27:42 +00:00
831c9010c7 [BE] Remove non-existing operator from unimplemented list (#156025)
Never heard of torch.login :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025
Approved by: https://github.com/dcci
2025-06-16 14:14:58 +00:00
38e5e81e55 Implement guard collectives (#155558)
When running a distributed job with compiler collectives enabled, if one rank recompiles while others do not, this leads to a deadlock (as not everyone will rendezvous with the compiler collective from the recompile). Although there aren't any convenient ways to cheaply solve this problem, if you are willing to force everyone to sync when evaluating guards, you can just force everyone to recompile if anyone requires a recompile. So the way guard collectives work is:

1. Perform compiled code lookup (evaluating guards)
2. Run a collective, communicating if you found a compiled code or not
3. If anyone requires recompile, force everyone to recompile

One current deficiency in the implementation is we can't conveniently track the time it takes to run this collective.

I need to test if we actually successfully are running the collective on a separate stream, or if we have to wait for user collectives to all finish.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155558
Approved by: https://github.com/Microve
2025-06-16 14:09:14 +00:00
05faba4028 Bump requests from 2.32.2 to 2.32.4 in /.github (#155491)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-16 06:48:08 -07:00
d6ee5144ca [xla hash update] update the pinned xla hash (#156064)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156064
Approved by: https://github.com/pytorchbot
2025-06-16 11:11:10 +00:00
8142a02860 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-16 10:55:47 +00:00
bf7e290854 Add __main__ guards to jit tests (#154725)
This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs.

In jit tests:

- Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run.
- Raise a RuntimeError on tests which have been disabled (not run)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725
Approved by: https://github.com/clee2000
2025-06-16 10:28:45 +00:00
f810e98143 [ONNX] Update default opset to 18 (#156023)
Update default opset for the torchscript exporter to 18 to match the dynamo exporter, because support was actaully added and tested in https://github.com/pytorch/pytorch/pull/118828. In the next version we should plan to update to opset 21 or higher. This change also removes the hard limit on the torchscript exporter for more flexibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156023
Approved by: https://github.com/Skylion007
2025-06-16 08:40:49 +00:00
39c605e8b3 remove allow-untyped-defs from context.py (#155622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155622
Approved by: https://github.com/Skylion007
2025-06-16 07:38:34 +00:00
d9799a2ee7 Support boolean tensor for torch.fused_moving_avg_obs_fake_quant on CUDA (#153699)
Fixes #153310

As the title

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fused_obs_fake_quant_moving_avg
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153699
Approved by: https://github.com/mingfeima, https://github.com/jerryzh168
2025-06-16 07:10:06 +00:00
156b28e62a [audio hash update] update the pinned audio hash (#155648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155648
Approved by: https://github.com/pytorchbot
2025-06-16 03:57:28 +00:00
c620d0b5c7 convert: rst to myst pr2/2 (#155911)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following three .rst files with the mentioned referenced in each file

- [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst)
  - torch.compiler_troubleshooting
  - nonsupported_numpy_feats
  - torchdynamo_fine_grain_tracing

- [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst)
  - None

- [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst)
  - torch.compiler_overview
  - torch.compiler_api
  - torchdynamo_fine_grain_tracing

I made the suggested edits by the maintainers as commented in the parent PR
(used git mv on all files, yet it still appeared as delete-create action)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-16 00:44:44 +00:00
c83041cac2 [test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827)
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```

These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
2025-06-15 20:24:19 +00:00
bc9b8ea230 [user triton] JIT inductor support for new host-side TMA api (#155814)
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.

* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)

AOTI support is not implemented yet.

Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
2025-06-15 20:24:19 +00:00
b7c95acc6c [user triton] triton_kernel_wrap support for new host-side TMA API (#155777)
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.

Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)

Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
2025-06-15 20:24:19 +00:00
54976bca10 [dynamo] Provide helper functions for guard filter hook (#155083)
Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083
Approved by: https://github.com/zhxchen17, https://github.com/jansel
2025-06-15 17:49:36 +00:00
0935a97d95 [Dynamo] Add torch.accelerator API to trace_rules (#155884)
# Motivation
- Add binding API and non-binding API in torch.accelerator to trace rules.
- Add some function in torch.accelerator to const fold functon list for Dynamo capature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155884
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787, #155788
2025-06-15 17:09:57 +00:00
b51d803785 [Dynamo] Add XPU API to trace_rules (#155788)
# Motivation
- Add binding API and non-bindling API to trace rules for XPU;
- Add some XPU API to the const fold function for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155788
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #155787
2025-06-15 17:09:57 +00:00
69acba2b19 [Dynamo] Add generic and XPU-specific Stream&Event in UserDefineClass (#155787)
# Motivation
- Add XPU-specific Stream and Event to in graph calss list for Dynamo capture.
- Add generic Stream and Event to i graph class list for Dynamo capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155787
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-06-15 17:09:57 +00:00
53cd18f6b3 Update gradient behavior note in torch.amin and torch.amax (#155071)
Fixes #155048

The behavior of `min` and `max` were changed in #43519. The note about gradient behavior in torch.amin and torch.amax docs are updated to reflect this change:

New note:
`amax, amin, max(dim), min(dim) evenly distributes gradient between equal values
        when there are multiple input elements with the same minimum or maximum value.`

cc - @spzala @svekars @soulitzer @sekyondaMeta @AlannaBurke @ezyang @gqchen @nikitaved @Varal7 @xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155071
Approved by: https://github.com/soulitzer
2025-06-15 16:09:31 +00:00
655b3b14ff [executorch hash update] update the pinned executorch hash (#156007)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156007
Approved by: https://github.com/pytorchbot
2025-06-15 04:51:37 +00:00
517d2995e0 Add__int__ and __float__ methods to _sympy.functions.Identity (#155873)
Fixes #155688

Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`.  This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.

Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
2025-06-15 04:24:40 +00:00
6ebe9a4f47 [BE][Ez]: Optimize nvshmem alloc with missing move (#156000)
Saw this in another PR where there was a missing move on this potentially very hot path with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156000
Approved by: https://github.com/kwen2501, https://github.com/cyyever
2025-06-15 03:04:08 +00:00
32eee8ed22 [SymmMem] Add nvshmem_free (#155975)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed.

Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155975
Approved by: https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971
2025-06-15 01:23:49 +00:00
b8aee84fb9 [c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949)
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
2025-06-15 00:37:42 +00:00
3159ee2ad3 Update test_schedule_multiproc to use world_size=2 (#155921)
The multiproc schedule tests previously ran with world_size=2, and PP tests became flakier due to the longer pipeline execution, this is restoring previously behavior. This will fix the tests (https://github.com/pytorch/pytorch/issues/154373, https://github.com/pytorch/pytorch/issues/154391, https://github.com/pytorch/pytorch/issues/154408, https://github.com/pytorch/pytorch/issues/154443, https://github.com/pytorch/pytorch/issues/154481

In follow up PRs I will refactor the tests and move some tests to use large world sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155921
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
ghstack dependencies: #155920
2025-06-15 00:24:18 +00:00
8e1471bdc9 Allow MultiProcContinuousTest to set world_size (#155920)
`MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920
Approved by: https://github.com/fduwjj
2025-06-15 00:24:17 +00:00
9bd42c1570 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-15 00:08:50 +00:00
a35b3a9b95 [cutlass backend][forward fix] use _cuda_compiler path to check if nvcc exists (#155939)
Differential Revision: D76571828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155939
Approved by: https://github.com/Skylion007, https://github.com/masnesral
2025-06-15 00:01:57 +00:00
eqy
47c8810b52 [cuBLASLt][cuBLAS] Support 2D bias and beta != 1.0 in cuBLASLt (#154170)
Fixes https://github.com/pytorch/pytorch/issues/153590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154170
Approved by: https://github.com/malfet
2025-06-14 23:34:31 +00:00
0fa361e429 [ez] fix typo in _inductor/scheduler.py (#155996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155996
Approved by: https://github.com/Skylion007
ghstack dependencies: #155982
2025-06-14 21:21:35 +00:00
77ac3a0965 [SymmMem] Remove wrappers around nvshmem APIs (#155971)
`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018.

Therefore there is no need to build an extra layer to hide dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155971
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968
2025-06-14 19:58:09 +00:00
2c0d94a7de [SymmMem] Remove unused ptr_to_symm_mem_ (#155968)
No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty.
This PR removes it and supports relying functionalities via the `allocations_` map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155968
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835
2025-06-14 19:57:06 +00:00
a317c63d1b [BE]: Update NCCL to 2.27.3 (#155233)
Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517

This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233
Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy
2025-06-14 19:20:31 +00:00
794ef6c9b8 Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 19:14:31 +00:00
5285d10243 remove duplicated pybind flag in mps code (#155936)
gcc14 (at least) warns that this is already defined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155936
Approved by: https://github.com/Skylion007
2025-06-14 18:41:12 +00:00
e95e8eed0a mypy 1.16.0 (#155821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-06-14 18:18:43 +00:00
ce79056471 Custom FX pass for inductor's backend registration (#154841)
This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-06-14 17:29:54 +00:00
603a54a9b3 Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776)
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.

This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
-  it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals

I am also seeing a couple of wrong usages !! that i will fix  in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154774
2025-06-14 17:13:53 +00:00
c219dbd2fc avoid gso in has_internal_overlap (#155870)
existing comment already explains it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155870
Approved by: https://github.com/bobrenjc93
2025-06-14 17:13:20 +00:00
279cae52e7 [BE][PYFMT] migrate PYFMT for torch/ao/ to ruff format (#148185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148185
Approved by: https://github.com/ezyang
2025-06-14 16:47:04 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
370fc49dde Handle aten.to at submodule boundaries (#153972)
Summary: #buildall

Test Plan: CI

Differential Revision: D74582970

When we decompose to inference IR, aten.to can sometimes disappear. As a result, export module call graph tree will start containing dead nodes because previous provenance tracking is insufficient. This PR fixes that. The caveat is that this won't work in general for tensor subclass inputs to submodule that user wants to preserve signature because we always desugar the tensor subclass into constituent tensors in inference IR making it impossible to preserve the original calling convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153972
Approved by: https://github.com/avikchaudhuri
2025-06-14 16:13:29 +00:00
d42c11819f [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-06-14 16:09:41 +00:00
70b68caf58 Fix logging of failed tensorified ops (#155982)
Tested via

```
TORCH_LOGS="torch.fx.passes._tensorify_python_scalars" tlp python test/inductor/test_torchinductor_dynamic_shapes.py -k test_unspecialized_float_fallback_symint_specialization
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function pow>
I0613 21:50:38.247000 4163366 torch/fx/passes/_tensorify_python_scalars.py:314] [0/1] Failed to tensorify <built-in function floor>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155982
Approved by: https://github.com/flaviotruzzi
2025-06-14 14:23:54 +00:00
e375d21bb9 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-14 14:12:38 +00:00
1a568f4e5d [BE][Easy] bump isort to 6.0.1 (#155919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155919
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909, #155914
2025-06-14 12:29:01 +00:00
5467765990 [BE][Easy] bump ruff to 0.11.13 (#155914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155914
Approved by: https://github.com/Skylion007
ghstack dependencies: #155909
2025-06-14 12:29:01 +00:00
736a15a81a [torchgen] Fix ruff format for # fmt: skip comment for function signature (#155909)
See also:

- astral-sh/ruff#18658

This fix follows the suggestion from:

- https://github.com/astral-sh/ruff/issues/18658#issuecomment-2970130276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155909
Approved by: https://github.com/ezyang
2025-06-14 12:28:55 +00:00
d859e65826 [DCP][Ez]: Fix broadcast_object bug in DCP utils (#155912)
Fixes #152310. Broadcast_object is now symmetric with gather_object and scatter_object. It was likely a typo that wasn't fixed in https://github.com/pytorch/pytorch/pull/147675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155912
Approved by: https://github.com/ezyang
2025-06-14 12:14:14 +00:00
596b418391 [BE][PYFMT] migrate PYFMT for {torch,test}/{nn,optim}/** to ruff format (#144548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548
Approved by: https://github.com/ezyang
2025-06-14 11:27:04 +00:00
3e38feb05f [inductor] Add configuration control for CUTLASS operation selection. (#155770)
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.

**Fixes #155718**

## Usage Examples

```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"

# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"

# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"

# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
2025-06-14 08:19:54 +00:00
1982ec2d22 Add api info for torch._C._nn.pyi (#148405)
APis involved are as followed:

- adaptive_avg_pool2d
- adaptive_avg_pool3d
- binary_cross_entropy
- col2im

ISSUE Related:
https://github.com/pytorch/pytorch/issues/148404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405
Approved by: https://github.com/ezyang
2025-06-14 07:57:07 +00:00
7070ab3180 use guard_or_false in checkInBoundsForStorage (#155874)
this was added in https://github.com/pytorch/pytorch/pull/147354, the comment already justify guard_or_false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155874
Approved by: https://github.com/bobrenjc93
2025-06-14 07:21:26 +00:00
d79651571f assume sparse tensor not coalesced_ gsv -> guard_or_false. (#155869)
preserve current behavior. Generalize it such that no need for torch._check_is_size to opt into this,
and make it work for more complex unbacked sizes with ranges [-inf, inf]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155869
Approved by: https://github.com/bobrenjc93
2025-06-14 07:19:56 +00:00
e7da21806f [Easy][BE] update recommanded VS Code settings (#152760)
Changes:

- Remove old invalid settings and replace with new settings.
- Add commonly used VS Code extensions to support `cmake`, `ruff`, `mypy`, `flake8`, `editorconfig`, and spell checker. Also, add corresponding settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152760
Approved by: https://github.com/drisspg
2025-06-14 07:11:10 +00:00
cyy
1393f71e07 Use CUDA language in generated CMakeLists.txt from cpp_builder.py (#155979)
The CMake CUDA module has been deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155979
Approved by: https://github.com/ezyang
2025-06-14 06:52:51 +00:00
c843909d9e [flex attention][triton pin] use new TMA API (#155771)
Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API).

This PR does the following:
* replace the experimental TMA APIs with the stable TMA APIs
* remove the workspace args.

Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone]

Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155771
Approved by: https://github.com/mandroid6, https://github.com/nmacchioni
2025-06-14 06:34:16 +00:00
92b7ed6d07 Add Helion softmax test (#155976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155976
Approved by: https://github.com/jansel
2025-06-14 05:53:21 +00:00
9338d85d45 [ProcessGroupNCCL] Added log when fr dump triggered from pipe (#155754)
Summary:
TSIA

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
eyes

Sandcastle run

Differential Revision: D76472617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155754
Approved by: https://github.com/fduwjj, https://github.com/Skylion007
2025-06-14 04:34:29 +00:00
bf897b4cea [ONNX] Support 0/1 on dynamic dimension (#155717)
Previous to this PR, the exporter does not support dynamic dim with traced inputs containing 0/1. But after https://github.com/pytorch/pytorch/pull/148696, this is supported by torch.export.export. This PR adds the patch to torch.onnx.export.

However, there is still known pitfall existing because the difference between eager and export. Compiler needs to decide the exported shape ahead, and whether the "hidden broadcasting" being applied results in different export.

For example,

```python
import torch

class Model(torch.nn.Module):
    def forward(self, x, y, z):
        return torch.cat((x, y), axis=1) + z

model = Model()
x = torch.randn(2, 3)
y = torch.randn(2, 5)
z = torch.randn(1, 8)
model(x, y, z)

DYN = torch.export.Dim.DYNAMIC
ds = {0: DYN, 1: DYN}

with torch.fx.experimental._config.patch(backed_size_oblivious=True):
    ep = torch.export.export(model, (x, y, z), dynamic_shapes=(ds, ds, ds))

print(ep)
"""
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s7, s16]", y: "f32[s7, s43]", z: "f32[s7, s16 + s43]"):
             #
            sym_size_int: "Sym(s7)" = torch.ops.aten.sym_size.int(x, 0)
            sym_size_int_1: "Sym(s16)" = torch.ops.aten.sym_size.int(x, 1)
            sym_size_int_2: "Sym(s7)" = torch.ops.aten.sym_size.int(y, 0)
            sym_size_int_3: "Sym(s43)" = torch.ops.aten.sym_size.int(y, 1)
            sym_size_int_4: "Sym(s7)" = torch.ops.aten.sym_size.int(z, 0)
            sym_size_int_5: "Sym(s16 + s43)" = torch.ops.aten.sym_size.int(z, 1)

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            cat: "f32[s7, s16 + s43]" = torch.ops.aten.cat.default([x, y], 1);  x = y = None

             #
            eq: "Sym(True)" = sym_size_int_2 == sym_size_int;  sym_size_int_2 = None
            _assert_scalar_default = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s58, s35) on node 'eq'");  eq = _assert_scalar_default = None
            add_1: "Sym(s16 + s43)" = sym_size_int_1 + sym_size_int_3;  sym_size_int_1 = sym_size_int_3 = None
            eq_1: "Sym(True)" = add_1 == sym_size_int_5;  add_1 = sym_size_int_5 = None
            _assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(eq_1, "Runtime assertion failed for expression Eq(s16 + s43, s23) on node 'eq_1'");  eq_1 = _assert_scalar_default_1 = None
            eq_2: "Sym(True)" = sym_size_int == sym_size_int_4;  sym_size_int = sym_size_int_4 = None
            _assert_scalar_default_2 = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(s35, s7) on node 'eq_2'");  eq_2 = _assert_scalar_default_2 = None

             # File: /home/titaiwang/pytorch/test_export.py:7 in forward, code: return torch.cat((x, y), axis=1) + z
            add: "f32[s7, s16 + s43]" = torch.ops.aten.add.Tensor(cat, z);  cat = z = None
            return (add,)

Graph signature:
    # inputs
    x: USER_INPUT
    y: USER_INPUT
    z: USER_INPUT

    # outputs
    add: USER_OUTPUT

Range constraints: {s7: VR[0, int_oo], s16: VR[0, int_oo], s43: VR[0, int_oo], s16 + s43: VR[0, int_oo]}
"""
ep.module()(x, y, z)
"""
Traceback (most recent call last):
  File "/home/titaiwang/pytorch/test_export.py", line 20, in <module>
    ep.module()(x, y, z)
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 840, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 416, in __call__
    raise e
  File "/home/titaiwang/pytorch/torch/fx/graph_module.py", line 403, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1873, in _call_impl
    return inner()
           ^^^^^^^
  File "/home/titaiwang/pytorch/torch/nn/modules/module.py", line 1800, in inner
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/_dynamo/eval_frame.py", line 895, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/titaiwang/pytorch/torch/export/_unlift.py", line 83, in _check_input_constraints_pre_hook
    _check_input_constraints_for_graph(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 426, in _check_input_constraints_for_graph
    _check_symint(
  File "/home/titaiwang/pytorch/torch/_export/utils.py", line 338, in _check_symint
    raise RuntimeError(
RuntimeError: Expected input at *args[2].shape[0] to be equal to 2, but got 1
"""
```

The explanation (from @pianpwk):

In the model we have `return torch.cat((x, y), axis=1) + z`.

Before this add is executed, the LHS has shape `[s7, s16 + s43]`, while the z has shape, say `[s8, s16 + s43]` (we don't know `s7 == s8` yet). When we execute this add, the compiler is making a decision: does broadcasting apply or not? The choices are:

1) Yes -> then we must specialize `s8` to 1
2) No -> then this element-wise op is only valid if the shapes match up, and we assume `s7 == s8`.

Unfortunately export can only follow one of these options, and in avoiding 0/1 specialization (because a dynamic dimension was requested), it assumed case 2).

For an operation like a + b, in eager semantics it's possible to have all options (either a == 1 OR b == 1 OR a == b), but with export we need to make a decision on what the output shape of this operation is, and keeping all branches alive requires expressing the output shape with a conditional (e.g. output shape = `a if b == 1 else b`), which is pretty hard for the compiler to reason about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155717
Approved by: https://github.com/justinchuby
2025-06-14 04:04:47 +00:00
187828dcb4 [OpenReg][5/N] add set_.source_Storage for openreg (#155191)
**Changes**:
- add set_.source_Storage for openreg to support torch.load & torch.serialization
- uncomment some related tests in the test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155191
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181, #155101
2025-06-14 03:44:32 +00:00
e4fd0bf771 [OpenReg][4/N] Migrate cpp_extensions_open_device_registration to OpenReg (#155101)
As the title stated.

**Involved testcases**:
- test_open_device_storage_pin_memory
- test_open_device_serialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155101
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106, #154181
2025-06-14 03:44:32 +00:00
1e7989cad5 [OpenReg][3/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154181)
As the title stated.

**Involved testcases**:
- test_open_device_quantized
- test_open_device_random
- test_open_device_tensor
- test_open_device_packed_sequence
- test_open_device_storage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154181
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019, #154106
2025-06-14 03:44:32 +00:00
7e5f29b2de [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154106)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154106
Approved by: https://github.com/nareshrajkumar866, https://github.com/albanD
ghstack dependencies: #153947, #154018, #154019
2025-06-14 03:44:32 +00:00
676abded4b [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#154019)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154019
Approved by: https://github.com/albanD
ghstack dependencies: #153947, #154018
2025-06-14 03:44:32 +00:00
d3d469092f [Openreg] Split TestOpenReg into two parts (#154018)
----

- TestPrivateUse1: testing 3rd accelerator integration mechinasm itself
- TestOpenReg: testing openreg itself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154018
Approved by: https://github.com/albanD
ghstack dependencies: #153947
2025-06-14 03:44:31 +00:00
cafd2344d6 [OpenReg] add manual_seed related capabilities (#153947)
**Changes**:
- Add manual_seed manual_seed_all initial_seed and so on
- Delay execution of self._lazy_init more deeply
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153947
Approved by: https://github.com/albanD
2025-06-14 03:44:31 +00:00
297805fd8f Typo fixes for "overridden" in comments and function names (#155944)
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
2025-06-14 03:37:38 +00:00
ca3cabd24a Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696)
Fixes #155028

This pull request updates the documentation  by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-14 03:32:00 +00:00
cdfa33a328 [nativert] move execution frame to torch (#155830)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76369008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155830
Approved by: https://github.com/zhxchen17
2025-06-14 03:28:55 +00:00
a6084b71ed [BE][1/X] Phase out usage of use_max_autotune() (#155847)
`use_max_autotune()` is likely not what people expect it to be;

Originally, `use_max_autotune()` was setup to decide when we should include Triton templates as choices in GEMM autotuning. As expected, `use_max_autotune()=True` if `max_autotune=True` or `max_autotune_gemm=True`. However, with the addition of the offline GEMM autotuning cache two years back `use_max_autotune()=True` also in the case that `search_autotune_cache=True`; in this case though, `search_autotune_cache=True` should never trigger autotuning.

Over time, people have used `use_max_autotune()` likely without realizing that this gives unexpected behavior if `search_autotune_cache=True`. We could rename the method to be more clear, but prefer to phase it out entirely for maximal clarity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155847
Approved by: https://github.com/jingsh, https://github.com/masnesral
2025-06-14 03:16:20 +00:00
7982b8c703 [BE][AOTI] Remove duplicate schema for ExternKernelNode (#155867)
Summary: The definition of `ExternKernelNode` and `ExternKernelNodes` schema in `torch/_export/serde/aoti_schema.py` is a complete duplicate of the ones in `torch/_export/serde/schema.py`.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76558294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155867
Approved by: https://github.com/jingsh
2025-06-14 02:03:27 +00:00
8f5f01bf19 [BE][AOTI] Combine DynamicArgType in Proxy Executors (#155871)
Summary:
As title.

Move the duplicate definition to the base class header `proxy_executor.h`

Test Plan:
CI

Rollback Plan:

Differential Revision: D76559180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155871
Approved by: https://github.com/yushangdi
2025-06-14 01:52:43 +00:00
4574b39aa4 Revert "[BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)"
This reverts commit bbbced94a43cf764ddfe719e7d4c161a3992830c.

Reverted https://github.com/pytorch/pytorch/pull/155709 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15645591737/job/44082402642) [HUD commit link](bbbced94a4) landrace with 155819? easy forward fix but its the end of the week so idk when id get a review ([comment](https://github.com/pytorch/pytorch/pull/155709#issuecomment-2972094849))
2025-06-14 01:43:16 +00:00
c10339559d [BE] Better uv detection in pip init (#155972)
If one has some UV and non-UV environments locally, one shoudl call `uv
pip install` only on the UV-enabled ones, which could be detected by
checking if `uv/python` path is present in `sys.base_prefix`

Fixes https://github.com/pytorch/pytorch/issues/152999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155972
Approved by: https://github.com/janeyx99
2025-06-14 01:35:50 +00:00
d7e3c9ce82 Revert "Enable manywheel build and smoke test on main branch for ROCm (#153287)"
This reverts commit 3b6569b1ef4b9ff25f5b75fe0a216d6d084d573f.

Reverted https://github.com/pytorch/pytorch/pull/153287 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/15646152483/job/44083912145) [HUD commit link](3b6569b1ef) ([comment](https://github.com/pytorch/pytorch/pull/153287#issuecomment-2972088294))
2025-06-14 01:32:27 +00:00
c165b36a31 [MTIA Aten Backend] Migrate relu / relu_ (#155927)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate relu / relu_.

Note: Pytorch in-tree implementation delegates relu to clamp_min, so no more need to launch relu kernel.
https://www.internalfb.com/code/fbsource/[0c9eedb2fc8f99bcca00cb67a5738cfe07e39349]/fbcode/caffe2/aten/src/ATen/native/Activation.cpp?lines=512-520

Let me know if any concern about this

Differential Revision: [D75803582](https://our.internmc.facebook.com/intern/diff/D75803582/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155927
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925, #155926
2025-06-14 01:24:48 +00:00
50f6431e0a [MTIA Aten Backend] Migrate sqrt.out / rsqrt.out / sin.out / silu.out (#155926)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate sqrt.out / rsqrt.out / sin.out / silu.out

Differential Revision: [D75801847](https://our.internmc.facebook.com/intern/diff/D75801847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155926
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659, #155925
2025-06-14 01:24:48 +00:00
7b11cb8c12 [MTIA Aten Backend] Migrate tanh.out and tanh_backward.grad_input (#155925)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate tanh.out and tanh_backward.grad_input

Differential Revision: [D75769242](https://our.internmc.facebook.com/intern/diff/D75769242/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155925
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632, #154659
2025-06-14 01:24:41 +00:00
0185d3a5ed [MTIA Aten Backend] Migrate bitwise_or.Tensor_out (#154659)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_or.Tensor_out from out-of-tree to in-tree.

Differential Revision: [D75629937](https://our.internmc.facebook.com/intern/diff/D75629937/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154659
Approved by: https://github.com/egienvalue
ghstack dependencies: #154632
2025-06-14 01:24:34 +00:00
163cdaaa3a [MTIA Aten Backend] Migrate bitwise_not.out (#154632)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate bitwise_not.out from out-of-tree to in-tree.

Differential Revision: [D75610643](https://our.internmc.facebook.com/intern/diff/D75610643/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154632
Approved by: https://github.com/egienvalue
2025-06-14 01:24:27 +00:00
04cf2c9d24 fix tensor print behavior for MAIA (#155609)
This pull request fixes the tensor print behavior for `MAIA` to account for the absence of double-precision support in its backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155609
Approved by: https://github.com/soulitzer
2025-06-14 01:04:12 +00:00
dabb55baff Add resolve in add decomp to enable view (#153945)
Fixes #148950.

During the construction of graph and running the node of add under [interpreter](/github.com/pytorch/pytorch/blob/d68d4d31f4824f1d1e0d1d6899e9879ad19b0754/torch/fx/interpreter.py#L301
), the functional argument of conj complex tensor gets cloned. This result in always having *.is_conj()* evaluted to false in decomposition function.

Propose a fix of calling resolve_conj() in the decomposition of complex tensor add.

Test as below
`python test/dynamo/test_repros.py ReproTests.test_add_complex_conj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153945
Approved by: https://github.com/jansel
2025-06-14 00:41:50 +00:00
fec571cfd4 [BE][CI] Remove hardshrink integer exclusions (#155965)
As they are not called anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965
Approved by: https://github.com/dcci
2025-06-14 00:32:57 +00:00
38410cf9b5 Fix DDPOptimizer issue on static tensor index (#155746)
We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)

The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules.
bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)

Fixes #140395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746
Approved by: https://github.com/xmfan, https://github.com/bdhirsh
2025-06-14 00:15:58 +00:00
3b6569b1ef Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 00:05:57 +00:00
bbbced94a4 [BE]: Sync cusparselt 12.9 with static build and other cuda 12 (#155709)
followup for https://github.com/pytorch/pytorch/pull/154980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155709
Approved by: https://github.com/tinglvv, https://github.com/atalman, https://github.com/nWEIdia, https://github.com/cyyever
2025-06-13 23:10:01 +00:00
d512584718 [BE] Refactor clamp dtypes check (#155930)
By introducing `check_for_unsupported_clamp_dtypes` similar to `check_for_unsupported_isin_dtypes`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155930
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/clee2000
ghstack dependencies: #155470
2025-06-13 23:05:02 +00:00
0cb85c188f [BE] Move optional submodules checkout to its own module (#155947)
To expand it to optional eigen checkout later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155947
Approved by: https://github.com/Skylion007
2025-06-13 23:02:38 +00:00
3003c681ef Converting .rst files to .md files (#155377)
Fixes #155036
This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:54:27 +00:00
799443605b Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst (#155297)
Fixes #155019

## Description
Convert to markdown: distributed.tensor.parallel.rst, distributed.tensor.rst, distributions.rst, dlpack.rst

## Checklist
- [X] dlpack.rst converted to dlpack.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/dlpack.html)
- [X] distributions.rst converted to distributions.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributions.html)
- [X] distributed.tensor.rst converted to distributed.tensor.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.html)
- [X] distributed.tensor.parallel.rst converted to distributed.tensor.parallel.md --> [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155297/distributed.tensor.parallel.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155297
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:08:37 +00:00
764c02b78b [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Mark few more unary ops as unimplemented to satisfy foreach testing error reporting consistency between CPU and CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-13 22:07:03 +00:00
d59ed21d0f [CI] Reuse old whl: track why failed to use the old whl (#155860)
As in title
Any other things I should track?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155860
Approved by: https://github.com/malfet
2025-06-13 22:01:31 +00:00
3596c0c77f Fix test after revert (#155946)
ex
test_dynamic_shapes.py::TestUbackedOps::test_unbacked_reshape2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15642199583/job/44073674212) [HUD commit link](06408dae49)

started after 06408dae49d06b6146fdd9d7a37eb5dde4f5e78d

idk what the test does so maybe theres a better way to fix this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155946
Approved by: https://github.com/yangw-dev, https://github.com/huydhn, https://github.com/malfet
2025-06-13 21:52:07 +00:00
eef253d9f6 [CI] Keep going display on HUD: upload log when test fails (#155371)
I guess this is more of an RFC

Goal:
Enable keep going so that we can get information immediately for failures.  We want be aware of failures as soon as possible, especially on the main branch, this is so that reverts can happen quickly.

Proposal:
A job with `keep-going` will continue through errors in `python run_test.py`.  If a test fails, before it runs the next test, it will upload a fake log that should have enough information in it so that viewing the log will be able to tell you what failed and any stack traces/error logs, and should be able to be parsed by log classifier to get a line.

I am getting the log by concating the test logs in test/test-reports, which is all the text outputted by pytest (unless someone runs with `ci-verbose-test-logs` label).  There are obviously many things this won't catch, ex output outside of run_test.py, some output inside of run_test.py, but it should be enough.

After a log finishes, eventually its raw log is uploaded to ossci-raw-job-status s3 bucket and the log classifier will read it to do classification.  This means we will have to change log classifier to read from this bucket as well.
I'm thinking just add an input parameter to log classifier like https://github.com/pytorch/test-infra/pull/6723/files
Also upload the temp results to a temp attribute instead of the real one

To overwrite the conclusion on HUD, I'm thinking a lambda that is s3 put trigger on the fake log being put into s3, that does something similar to log classifier where it just mutates the entry 13a990b678/aws/lambda/log-classifier/src/network.rs (L85) to add a new field like "will_fail": true, and also triggers the log classifier to run

Then we change HUD/ClickHouse to point the raw log url to the alternate place, the new "will_fail" field as the conclusion, and the temp log classifier result if needed

Why always write to temp attribution/column? I am unsure about overwriting the real results with fake ones

Pros:
Not many changes outside of HUD/UI

Cons:
Lots of moving parts, lots of temp fields that will require adjustment for queries, temp fields never really get deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155371
Approved by: https://github.com/malfet
2025-06-13 21:21:55 +00:00
e5ed267f83 Update h100-distributed image (#155861)
Move non inductor workflows cuda 12.6->cuda 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155861
Approved by: https://github.com/seemethere
2025-06-13 21:17:05 +00:00
20a74c370b Add error message with assert to topK if ndims() - dim > 4 (#155475)
Addressing #154890

Not really a proper fix but at least it's more informative than the current crash.

For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475
Approved by: https://github.com/kulinseth
2025-06-13 21:10:06 +00:00
049dc48d1e fix code chunk indentation for jit_language_reference_v2.md (#155937)
Fixes https://github.com/pytorch/pytorch/issues/155023
Related PR: #155781

Description:
As discussed, this PR is a follow-up update for `jit_language_reference_v2.md` by deleting the code chunk indentation.

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label "module: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155937
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-06-13 21:05:23 +00:00
731351bb4a Convert rst to markdown - optim.rst #155031 (#155813)
Fixes #155031
![image](https://github.com/user-attachments/assets/36507ca1-eb1e-4358-9e66-ce25ec8a2be1)

@pytorchbot label "docathon-h1-2025" "module: docs" "topic: not user facing" "topic: docs"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155813
Approved by: https://github.com/AlannaBurke
2025-06-13 21:03:39 +00:00
92388bb2ab [export] Remove broken check for multiple cpp files in PT2 package (#155149)
This check was recently added, but (when fixed to refer to CPP rather than library files) fails with the separate kernel and wrapper build of AOTInductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155149
Approved by: https://github.com/angelayi
2025-06-13 21:02:31 +00:00
7d1b3f599d [Docs] Convert to markdown cond.rst, config_mod.rst (#155653)
Related to #155014

Only included 2 files in this PR:

- cond.rst
- config_mod.rst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155653
Approved by: https://github.com/svekars
2025-06-13 20:58:57 +00:00
fdf5d97fa8 [cutlass backend][ez] Log timings from prescreening (#155757)
Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757
Approved by: https://github.com/ColinPeppler
2025-06-13 20:44:04 +00:00
f3e6c8e834 Fix #155016 for Docathon - convert rst to markdown (#155198)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

One note is that "Created On" and "Last Updated On" banner doesn't show in the markdown files... I'm not sure if that's just an artifact of my local build though.

Fixes #155016

Docs comparison (check out the 'new' whenever docs build)

1. cuda ([old](https://docs.pytorch.org/docs/main/cuda.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.html))
2. cuda.tunable ([old](https://docs.pytorch.org/docs/main/cuda.tunable.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/cuda.tunable.html))
3. leave cudnn_persistent_rnn.rst as is because it's reused in docstrings
4. cudnn_rnn_determinism.rst as is because it's reused in docstrings.
5. data ([old](https://docs.pytorch.org/docs/main/data.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155198/data.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155198
Approved by: https://github.com/albanD, https://github.com/svekars
2025-06-13 20:24:34 +00:00
bf798a2f01 Change _hfstorage to hfstorage (#155837)
Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D76364024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837
Approved by: https://github.com/saumishr
2025-06-13 20:19:51 +00:00
77f884c2ec Optimize Tensor.backward type hints (#155656)
Fixes #81963

## Test Result

![image](https://github.com/user-attachments/assets/67380fdc-73c4-43d8-b2a5-5e16d63f4fd3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155656
Approved by: https://github.com/soulitzer
2025-06-13 19:16:48 +00:00
06408dae49 Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)"
This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec.

Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))
2025-06-13 19:11:43 +00:00
4628f1b7a9 [Hierarchical-Compile] Track mutations for setitem (#155880)
This fixes a bug in tensor variable where we would not do things like set the example value on setitem nodes (but these don't typically have users so it doesn't matter)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155880
Approved by: https://github.com/anijain2305
2025-06-13 18:59:31 +00:00
344731fb25 Add CUDA 12.9.1 sbsa nightly binaries (#155819)
https://github.com/pytorch/pytorch/issues/155196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155819
Approved by: https://github.com/atalman
2025-06-13 18:52:41 +00:00
ce44877961 [c10d][PGNCCL] Make watchdog thread a class (#155831)
By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-06-13 18:05:22 +00:00
c5d00e150a convert: rst to myst pr 1/2 (#155840)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following two .rst files
- [torch.compiler_dynamo_overview](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_dynamo_overview.rst)
- [torch.compiler_fake_tensor](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fake_tensor.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155840
Approved by: https://github.com/sekyondaMeta
2025-06-13 18:02:28 +00:00
36bf81e363 [BE] Fix minifier when one has multiple Python runtimes (#155918)
By using `sys.executable` instead of `"python"`

Otherwise, it fails on Ubuntu with `python not found` error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155918
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/zou3519
2025-06-13 17:55:04 +00:00
093aaccae2 convert jit_language_reference_v2.rst to jit_language_reference_v2.md (#155781)
Fixes https://github.com/pytorch/pytorch/issues/155023

Description:
converted `jit_language_reference_v2.rst` to `jit_language_reference_v2.md`
**I indented the code blocks to minimize the file difference to pass the sanity check for no more than 2000 lines of change. I will submit another PR to fix the indentation after this PR is merged.**

Checklist:

- [x]  The issue being fixed is referenced above (Fixes https://github.com/pytorch/pytorch/issues/155023)
- [x]  Only one issue is addressed in this pull request
- [x]  Labels from the issue that this PR is fixing are added to this pull request
- [x]  No unnecessary issues are included into this pull request.

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155781
Approved by: https://github.com/svekars
2025-06-13 17:33:10 +00:00
f0bee87eea [xla hash update] update the pinned xla hash (#155779)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155779
Approved by: https://github.com/pytorchbot
2025-06-13 17:13:37 +00:00
1f3cc4875c [ATen][CUDA][cuSOLVER] Add cusolverDnXsyevBatched for torch.linalg.eigh (#155695)
This PR add a new API for SYEV operation of cuSOLVER [`cusolverDnXsyevBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdnxsyevbatched) which is a new alternative to [`cusolverDn<t>syevjBatched`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-t-syevjbatched). This API was introduced in cuSOLVER as part of 64-bit API in CUDA Tool Kit 12.6.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155695
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-06-13 17:12:26 +00:00
b6add8c8ba [MPSInductor] Fix remainder implementation for int types (#155891)
Introduce `c10:🤘:remainder` and call it from both inductor and eager implementation, with integer specialization, which should make it much faster than before, while still compliant with Python way of rounding up negative numbers.

This allows one to remove complex type detection logic from mps codegen and rely on Metal(C++) type system to figure out input and output types.

This fixes compilation of something like
```python
@torch.compile
def f(x, y):
    return x[y % 5]
```

which beforehand failed to compile with
```
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant long* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[x0];
        auto tmp1 = 12;
        auto tmp2 = static_cast<float>(tmp0) - static_cast<float>(tmp1) * metal::floor(static_cast<float>(tmp0) / static_cast<float>(tmp1));
        auto tmp3 = 1024;
        auto tmp4 = static_cast<long>(tmp3);
        auto tmp5 = tmp2 + tmp4;
        auto tmp6 = tmp2 < 0;
        auto tmp7 = tmp6 ? tmp5 : tmp2;
        if ((tmp7 < 0) && (tmp7 > 1024)) return;
        auto tmp9 = in_ptr1[tmp7];
        out_ptr0[x0] = static_cast<float>(tmp9);
    }
 with program_source:372:28: error: array subscript is not an integer
        auto tmp9 = in_ptr1[tmp7];
                           ^~~~~
```

This fixes fail_to_compile for GPT2ForSequenceClassification Huggingface model using `transformers==4.44.2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155891
Approved by: https://github.com/manuelcandales
2025-06-13 16:42:56 +00:00
9462106b7e [nativert] Move graph_passes to nativert (#155411)
Summary: Move graph_passes to nativert

Test Plan:
CI

Rollback Plan:

Differential Revision: D76205048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155411
Approved by: https://github.com/zhxchen17
2025-06-13 16:41:01 +00:00
338a8c7853 fix slice w/ dynamic shapes (#153131)
Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes.
Cause overflow error with a huge number “9223372036854776048”
Test Plan: CIs should pass.

Differential Revision: D74354663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131
Approved by: https://github.com/laithsakka
2025-06-13 15:53:17 +00:00
a5938ff431 [BE][c10d/Store]add check in pyi (#155855) (#155865)
Summary:

"check" is already binded https://fburl.com/code/9lx1zf9o
which is also documented in https://docs.pytorch.org/docs/stable/distributed.html
add it to pyi for type checking

Test Plan:
skip

Rollback Plan:

Differential Revision: D76547457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155865
Approved by: https://github.com/fduwjj
2025-06-13 15:39:27 +00:00
bee93f9f0d Move glslc to cas to enable remote execution (#155832)
Meta:
`fbsource//xplat/caffe2:gen_torch_vulkan_spv_cpp` takes on average 2 min to build and is one of topmost slow targets in fbandroid.
See: https://fb.workplace.com/groups/2840058936242210/posts/4067730240141734

This target hat to run locally because it uses manifold backend for dotslash. This diff moves the `glslc` to cas backend so that it can run on RE.

Here are commands executed:
```
% manifold get dotslash_glslc/flat/glslc-linux-x86_64.tar.gz
% manifold get dotslash_glslc/flat/glslc-macos-v2024_4.tar.gz
% manifold get dotslash_glslc/flat/glslc-windows-v2024_3.tar

% ls
-rw-r--r--  1 navidq  staff   2.0M Jun 12 10:02 glslc-linux-x86_64.tar.gz
-rw-r--r--  1 navidq  staff   4.7M Jun 12 10:03 glslc-macos-v2024_4.tar.gz
-rw-r--r--  1 navidq  staff   4.4M Jun 12 10:03 glslc-windows-v2024_3.tar

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-linux-x86_64.tar.gz
ea5d674e0e7e9782be3f5c309e3484732e5b3a331cbe3258f3e929002811627b:2072937

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-macos-v2024_4.tar.gz
1331dc691835e4676832b7c21ef669083a3acc8856981583d0698192f466c51a:4898649

% frecli --use-case dotslash cas upload-blob --skip-find-missing glslc-windows-v2024_3.tar
76181fbb1ce5c62d0c905db26df3a64e999d0baff2e93270775921daa91e3a1a:4585984
```

Differential Revision: [D76513735](https://our.internmc.facebook.com/intern/diff/D76513735/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155832
Approved by: https://github.com/GregoryComer
2025-06-13 14:38:51 +00:00
ce6e0523f9 Revert "[BE] Raise NotImplementedError (#155470)"
This reverts commit 5ab6a3fb6fd37c542060c606edd4b95c7e3cae82.

Reverted https://github.com/pytorch/pytorch/pull/155470 on behalf of https://github.com/malfet due to foreach tests are failing on ROCm because we are not running the same on CUDA ([comment](https://github.com/pytorch/pytorch/pull/155470#issuecomment-2970592124))
2025-06-13 14:32:50 +00:00
3819584f12 [precompile] Implement PrecompileContext for recording precompile artifacts, integrate with CompilePackage (#154415)
This PR implements a basic interface and test for PrecompileContext, a special CacheArtifactManager specifically designed for precompile. The job of a PrecompileContext is to record things precompile needs as torch is compiling,  dump it all into bytes, and then stitch it back together into a cache of callables.

## Why use CacheArtifactManager?
Precompile needs a way to record various serializable data as torch is compiling. CacheArtifactManager already does this today pretty well, handling a lot of serialization and cache information. So we're reusing a bunch of that infrastructure directly.

## How is it different from CacheArtifactManager?
Unlike regular CacheArtifactManager, PrecompileContext needs to be able to take the recorded artifacts and stitch them together after deserialization, to create a single working callable.
Since PrecompileContext doesn't need the cache keys, the "key" field of PrecompileArtifacts can be used for metadata relating to how to stitch the individual functions being compiled together into a full callable. For example, on a given dynamo compile, if there are multiple functions (via graph breaks or recompiles) being compiled, MegaCache would organize it like so:

![image](https://github.com/user-attachments/assets/49a0a75b-1e7f-4d96-8d81-6769fe5a53ca)

Whereas we'd visualize PrecompileContext's result like so:

![image](https://github.com/user-attachments/assets/fcc0dd4e-dfbf-4b13-9c08-2e99b373180b)

For now, we just handle eager mode; in the diff above, I'll hook up the other backend artifacts from PrecompileContext.

After this PR, precompile consists of three main interfaces:

### CompilePackage
- Everything needed to run one torch.compile'd function (including graph breaks)
- `__init__(fn, cache_entry)` Initializes with a DynamoCacheEntry
- `install(backends)` load precompile artifacts into function's dynamo state with a dictionary of backends
- `cache_entry()` return a serializable cache entry to save

### DynamoStore
- Responsible for tracking CompilePackages on disk (and/or in memory)
- `load_package(path)`: load a package given a torch compiled function and a path to the cache artifact
- `save_package(package, path): Save a CompiledPackage to a path. Calls PrecompileContext to grab backend data
- `record_package(package)`: Record a package to PrecompileContext (for global serialization/deserialization)

### PrecompileContext
- Overarching context for serializing and deserializing precompile artifacts. Supports **global** and **local** setups.
- `serialize()`: (Global) serializes all artifacts in PrecompileContext into bytes
- `populate_caches(bytes)`: (Global) takes serialized bytes and puts them into DynamoStore (TODO)
- `serialize_artifact_by_key(key)`: (Local) serialize a single artifact by its cache key

<img width="1455" alt="image" src="https://github.com/user-attachments/assets/99b61330-7607-4763-bdbc-85b366e82cdd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154415
Approved by: https://github.com/zhxchen17
ghstack dependencies: #155118
2025-06-13 14:11:24 +00:00
b2fc9cfea1 [precompile] Add CompilePackage to serialize dynamo states. (#155118)
Adding a per torch.compile() object CompilePackage which tracks dynamo artifact. CompilePackage is considered a low level component and should not be directly exposed to end users. It has the following interface:

1. `CompilePackage.__init__()` which optionally takes previously serialized dynamo states.
     a. when `dynamo` argument is None, it will contruct a brand new CompilePackage object.
     b. when `dynamo` argument is not None, it will load a pre-compiled dynamo state.
2. `package.save()` which dumps the dynamo states into _DynamoCacheEntry.
3. `package.install(backends)` which will handle all the side-effectful global scope updates with compiled functions and resume functions.

This diff focus on making the low level mechanism for precompile. It will be left to upper level interface to use these API to build more user-facing frontend.

Differential Revision: [D75956538](https://our.internmc.facebook.com/intern/diff/D75956538/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155118
Approved by: https://github.com/jamesjwu

Co-authored-by: James Wu <jjwu@meta.com>
2025-06-13 13:54:10 +00:00
670dab6c63 [AOTI] Enable OP test__weight_int4pack_mm_with_scales_and_zeros in AOTI. (#155780)
The op test__weight_int4pack_mm_with_scales_and_zeros is for Intel GPU. It is functionally equivalent to the CUDA/CPU op test__weight_int4pack_mm (with the constraint that oneDNN only supports integer zero points, which is why we need this API). Since test__weight_int4pack_mm is already included in AOTI's fallback list, this PR adds support for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155780
Approved by: https://github.com/jansel
2025-06-13 11:12:13 +00:00
463fe36532 fix error message on specialization with Dim.DYNAMIC (#155738)
Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`.

This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation.

Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources.

Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738
Approved by: https://github.com/bobrenjc93
2025-06-13 10:33:46 +00:00
6abe450a6f [pytorch Aten] Delete unused duplicate clamp_stub, to avoid compile error (#154631)
I found the `clamp_stub` in `UnaryOps.h` is not used. And it's a duplicate of the `clamp_stub` in `TensorCompare.cpp`:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorCompare.cpp#L313-L314

This diff/PR deletes it as this duplicate caused build failure for me:
```
ATen/native/UnaryOps.h:109:1: error: redefinition of 'clamp_stub_DECLARE_DISPATCH_type'
```

Differential Revision: [D75612521](https://our.internmc.facebook.com/intern/diff/D75612521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154631
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/nautsimon
ghstack dependencies: #154589, #154591
2025-06-13 10:01:51 +00:00
1cc31b213d [MTIA Aten Backend] Migrate bitwise_and.Tensor_out (#154591)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75578498](https://our.internmc.facebook.com/intern/diff/D75578498/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154591
Approved by: https://github.com/malfet
ghstack dependencies: #154589
2025-06-13 10:01:51 +00:00
65b9c13cce [Intel GPU] Enable safe softmax for XPU SDPA (#151999)
Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975

When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption.

With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 08:53:47 +00:00
56b03df6ac [MTIA Aten Backend] Migrate where.self and where.self_out (#154589)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

- Migrate where.self and where.self_out
- Add tests for dtype casting and shape broadcasting

Differential Revision: [D75577304](https://our.internmc.facebook.com/intern/diff/D75577304/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154589
Approved by: https://github.com/malfet
2025-06-13 08:25:13 +00:00
3d595fd559 update get start xpu (#151886)
update link and product name
add print to print ```torch.xpu.is_available()``` result in code snippet for user not using command python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151886
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 07:46:13 +00:00
53d06e18d9 [dynamo] add missing algorithm header (#154754)
Needed for `std::max(<initializer-list>)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154754
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2025-06-13 06:56:11 +00:00
6020440683 remove allow-untyped-defs from adaround_fake_quantize.py (#155621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155621
Approved by: https://github.com/Skylion007
2025-06-13 06:14:22 +00:00
99e99d5bfe [a2av] Test must allocate tensors symmetrically (#155835)
This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks.

In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155835
Approved by: https://github.com/fduwjj, https://github.com/fegin, https://github.com/Skylion007
ghstack dependencies: #155506
2025-06-13 06:05:38 +00:00
0860606729 [export] Add meta[val] to getattr nodes (#154934)
Fixes [P1830293318](https://www.internalfb.com/intern/paste/P1830293318/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154934
Approved by: https://github.com/yushangdi, https://github.com/muchulee8
2025-06-13 05:48:21 +00:00
25717da8c8 [BE] Don't run the same tests on 2xlarge and 4xlarge (#155859)
Also, speedup builds by moving them to 4xlarge instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155859
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr
2025-06-13 05:40:20 +00:00
a87dfc7480 [symm_mem] Update CMakeList to reflect code moving a dedicated folder (#155823)
We moved all symm_mem code into a folder ([CudaDMAConnectivity](https://github.com/pytorch/pytorch/pull/155573)) but somehow forgot update for CudaDMAConnectivity in the CMakeList.

Users see errors: RuntimeError: DMA connectivity detector for cuda over nvlink is not available while torch.distributed.init_process_group(backend=backend). So this PR should fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155823
Approved by: https://github.com/Skylion007
2025-06-13 05:27:59 +00:00
70bb34929a Convert to .md: draft_export.rst, export.ir_spec.rst, fft.rst (#155567)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155020. This PR is split into 3 to pass sanity check.

Docs comparison (check out the 'new' whenever docs build)

1. draft_export ([old](https://docs.pytorch.org/docs/main/draft_export.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/draft_export.html))
2. export.ir_spec ([old](https://docs.pytorch.org/docs/main/export.ir_spec.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/export.ir_spec.html))
3. fft ([old](https://docs.pytorch.org/docs/main/fft.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155567/fft.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155567
Approved by: https://github.com/svekars
2025-06-13 05:19:43 +00:00
b878ca0c91 [cutlass backend] add fp8 to cutlass benchmark script (#155507)
Summary:
Add fp8.

Right now FP8 only allows fast_accum.

Test Plan:
```
Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         aten          | 967.1226739883423  | 1136.8895149998868 |  1.219131228979677   |         NA         |
|        triton         | 1764.6185159683228 |  623.08743664783   |  20.373826419003308  | 82.46067054670186  |
| triton_persistent_tma | 1769.0335512161255 | 621.5323768280928  |  20.48663099599071   | 82.91718297956578  |
|  cutlass_lvl_default  | 790.5075550079346  | 1390.8932568835019 |  13.788519630907103  | -18.26191482535096 |
|   cutlass_lvl_3332    | 803.7384748458862  | 1367.996757884245  |  226.81587297911756  | -16.89384434227684 |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
```

Rollback Plan:

Differential Revision: D76310809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507
Approved by: https://github.com/ColinPeppler
2025-06-13 05:11:15 +00:00
2ba930d4ce Convert rst to markdown - profiler.rst #155031 (#155559)
Fixes https://github.com/pytorch/pytorch/issues/155031

* [profiler.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/profiler.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155559
Approved by: https://github.com/svekars
2025-06-13 05:02:54 +00:00
e8b3dfa7c0 convert jit_language_reference.rst to jit_language_reference.md (#155633)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

- converted jit_language_reference.rst to jit_language_reference.md

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155633
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 04:58:28 +00:00
3f65e38b73 Convert hub.rst to hub.md (#155483)
Part of changes https://github.com/pytorch/pytorch/issues/155023 (parent PR https://github.com/pytorch/pytorch/pull/155429)

@pytorchbot label "topic: docs"
@pytorchbot label "topic: not user facing"
@pytorchbot label docathon-h1-2025
@pytorchbot label module: docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155483
Approved by: https://github.com/svekars
2025-06-13 04:39:55 +00:00
0a6b66c881 Inductor comms reorder logs to tlparse (#155737)
Hacked test_inductor_collectives test to demonstrate this works:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Follow up: it would be nice to move the logging out of this pass and
into the broader comms pass loop, where the before/after each pass
visualization could be logged into the same tlparse file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737
Approved by: https://github.com/bdhirsh
2025-06-13 02:59:42 +00:00
f151b20123 [AOTI] Remove the emit_current_arch_binary option (#155768)
Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768
Approved by: https://github.com/angelayi
2025-06-13 02:06:07 +00:00
020da74437 [Easy] Remove empty file (#155796)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155796
Approved by: https://github.com/malfet
ghstack dependencies: #155772
2025-06-13 01:42:11 +00:00
905b194a2e Replace device check of TORCH_INTERNAL_ASSERT with TORCH_CHECK (#155318)
Fixes #136849

## Test Result

```python
>>> import torch
>>> device = torch.cuda.device_count() + 1
>>> torch.cuda.current_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1083, in current_stream
    streamdata = torch._C._cuda_getCurrentStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.default_stream(device) #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/__init__.py", line 1101, in default_stream
    streamdata = torch._C._cuda_getDefaultStream(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device index value 3 is out of index range [0, 2)

>>> torch.cuda.set_per_process_memory_fraction(0.5, device)  #  INTERNAL ASSERT FAILED
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/cuda/memory.py", line 193, in set_per_process_memory_fraction
    torch._C._cuda_setMemoryFraction(fraction, device)
RuntimeError: Allocator not initialized for device : did you call init?

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155318
Approved by: https://github.com/albanD
2025-06-13 01:20:19 +00:00
d7e657da35 pyfmt lint more torch/utils files (#155812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155812
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782, #155783
2025-06-12 23:51:42 +00:00
4d3ecefda5 [aoti][mps] Use cpp sym-expr printer (#155646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155646
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582, #155583
2025-06-12 23:33:28 +00:00
2e65d72e1e [aoti][mps] Fix int/symint kernel args (#155583)
Integer arguments to mps kernels need to go through a different function, since `aoti_torch_mps_set_arg` only takes a Tensor. So I added a `aoti_torch_mps_set_arg_int`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155583
Approved by: https://github.com/desertfire
ghstack dependencies: #155752, #154287, #155582
2025-06-12 23:33:28 +00:00
ffbda61fbe [aoti][mps] Fix dynamic dispatch size (#155582)
In the case where we pass in a symint to the `dispatch` call, the compiler errors, so we need to cast the input to int64_t.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155582
Approved by: https://github.com/malfet
ghstack dependencies: #155752, #154287
2025-06-12 23:33:15 +00:00
a4ab392251 [aoti][mps] mps constants support (#154287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154287
Approved by: https://github.com/malfet
ghstack dependencies: #155752
2025-06-12 23:33:07 +00:00
8821a9dc4e [BE][aoti][mps] Fix tests to use common function (#155752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155752
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-06-12 23:32:59 +00:00
5ab6a3fb6f [BE] Raise NotImplementedError (#155470)
When op is unimplemented for a specific dtype

Which makes more sense, than a RuntimeError

Example
```python
>>> import torch
>>> torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
NotImplementedError: "hardshrink_cpu" not implemented for 'Long'
```

release notes bc-breaking: After this release `NotImplementedError` exception will be raised when ATen operation is called on the combinaiton of input tensor dtypes it has not been implemented for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155470
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-06-12 23:19:12 +00:00
d9b8369f39 fix warning spam for list indexing (#155815)
Per title, #154806 incorrectly placed a warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155815
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-06-12 23:07:24 +00:00
2903e5ad3c pyfmt lint more export files (#155783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155783
Approved by: https://github.com/Skylion007
ghstack dependencies: #155782
2025-06-12 23:04:11 +00:00
86b1116f22 pyfmt lint torch/_custom_op/* (#155782)
file torch/_custom_op/functional.py does not exisits
file torch/_custom_op/__init__.py is empty.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155782
Approved by: https://github.com/Skylion007
2025-06-12 23:04:11 +00:00
4cdbdcdbcf Switch to miniconda for ROCm CI (#155239)
Related to https://github.com/pytorch/pytorch/issues/148335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155239
Approved by: https://github.com/jeffdaily
2025-06-12 22:55:47 +00:00
f04fd4dc4e typing: allow integer in bitwise operations (#155704)
Fixes #155701 (false positives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704
Approved by: https://github.com/Skylion007, https://github.com/aorenste
2025-06-12 22:40:17 +00:00
938515fa75 [aoti] Update cshim for all backends (#155604)
Fixes https://github.com/pytorch/pytorch/issues/155349
`python torchgen/gen.py --update-aoti-c-shim` will now update all cpu/cuda/mps/xpu shims -- I verified this using `aten._print.default`, but didn't commit the changes since I'm not sure if we actually want to add this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155604
Approved by: https://github.com/desertfire, https://github.com/janeyx99
2025-06-12 22:10:58 +00:00
38bfd462b8 Use swap_tensors path in nn.Module.to for FakeTensor (#152539)
Fixes https://github.com/pytorch/pytorch/issues/148977

Differential Revision: [D76458023](https://our.internmc.facebook.com/intern/diff/D76458023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152539
Approved by: https://github.com/albanD
2025-06-12 22:08:21 +00:00
db01f1032f Support XPU in memory tracker (#150703)
This PR adds support for XPU devices to the distributed MemoryTracker tool, including unit test for XPU.

Specifically, this code adds tracking for a few alloc-related statistics for XPUCachingAllocator. It also adapts the existing memory tracker tool to be device agnostic, by getting the device module and recording the necessary memory stats. (I get the device module instead of using `torch.accelerator` methods, as that API is still in-progress.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150703
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/d4l3k
2025-06-12 21:33:52 +00:00
154a39bfbd basic compile support for grouped_mm (#153384)
grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153384
Approved by: https://github.com/eellison
2025-06-12 21:24:51 +00:00
f2b44424a1 [ROCm] Skip *_stress_cuda and test_ddp_apply_optim_in_backward* (#155724)
These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-06-12 21:18:04 +00:00
590fe4d2d7 Skip updating the default device distributed backend if already registered (#155320)
Motivation:

PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`).

Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However,  if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want.

Therefore, we would like to skip updating the default distributed backend if one is already registered in the map.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-06-12 21:17:06 +00:00
29391c7cf9 [ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811)
9df2e8020f/1

8e8d4b13b0 (43980565863-box)

started OOMing after switching to cuda 12.8

Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-12 21:05:32 +00:00
2303 changed files with 35343 additions and 16800 deletions

View File

@ -3,9 +3,7 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
fi

View File

@ -88,7 +88,6 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
@ -108,9 +107,9 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "128" in desired_cuda:
if "129" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]

View File

@ -39,6 +39,7 @@ RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt

View File

@ -1 +1 @@
f50bfa92602b45dca884a9e511e5d9ddbe8ba314
56392aa978594cc155fa8af48cd949f5b5f1823a

View File

@ -1 +1 @@
v2.26.5-1
v2.27.3-1

View File

@ -1 +1 @@
b0e26b7359c147b8aa0af686c20510fb9b15990a
ae324eeac8e102a2b40370e341460f3791353398

View File

@ -6,7 +6,7 @@ set -ex
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download" # @lint-ignore
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
@ -64,6 +64,11 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 --update-deps -c conda-forge
# Miniforge installer doesn't install sqlite by default
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
conda_install sqlite
fi
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.29=*openmp*"

View File

@ -26,6 +26,11 @@ Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF
# we want the patch version of 6.4 instead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
ROCM_VERSION="${ROCM_VERSION}.1"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
@ -67,19 +72,23 @@ EOF
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
VER_PATCH=.1
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

View File

@ -26,7 +26,7 @@ ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -2,7 +2,7 @@ FROM quay.io/pypa/manylinux_2_28_aarch64 as base
ARG GCCTOOLSET_VERSION=13
# Language variabes
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
@ -64,7 +64,7 @@ RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM base as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -60,7 +60,7 @@ RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
# remove unnecessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6

View File

@ -120,11 +120,13 @@ RUN python3 -mpip install cmake==3.28.0
# so just build it from upstream repository.
# h5py is dependency of onnxruntime_training.
# h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
# h5py 3.11.0 doesn't build with numpy >= 2.3.0.
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install flatbuffers && \
pip3 install h5py==3.11.0 && \
pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
pip3 install --no-build-isolation h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \

View File

@ -90,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.15.0
mypy==1.16.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
#Pinned versions: 1.16.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -382,3 +382,7 @@ cmake==4.0.0
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py

View File

@ -1 +1 @@
3.3.1
3.4.0

View File

@ -25,6 +25,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt

View File

@ -31,7 +31,6 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
@ -98,6 +97,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in
@ -151,7 +151,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake
CMAKE_FRESH=1 python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \

View File

@ -57,6 +57,10 @@ case ${CUDA_VERSION} in
12.8|12.9)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$DESIRED_CUDA" == "cu129" && "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
@ -103,12 +107,11 @@ DEPS_SONAME=(
)
# CUDA_VERSION 12.6, 12.8
# CUDA_VERSION 12.6, 12.8, 12.9
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
@ -124,7 +127,6 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
@ -143,12 +145,16 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libcufile.so.0"
"libcufile_rdma.so.1"
)
# Add libnvToolsExt only if CUDA version is not 12.9
if [[ $CUDA_VERSION != 12.9* ]]; then
DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")
DEPS_SONAME+=("libnvToolsExt.so.1")
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(

View File

@ -92,6 +92,7 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
exit 1
fi
pushd "$PYTORCH_ROOT"
retry pip install -q cmake
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1

View File

@ -187,7 +187,7 @@ do
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; separated arch list to bar for grep
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library

View File

@ -15,6 +15,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
export PYTORCH_TEST_WITH_ROCM=1
fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# TODO: Reenable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# shellcheck disable=SC2034
BUILD_TEST_LIBTORCH=0

View File

@ -93,7 +93,7 @@ def check_lib_symbols_for_abi_correctness(lib: str) -> None:
f"Found pre-cxx11 symbols, but there shouldn't be any, see: {pre_cxx11_symbols[:100]}"
)
if num_cxx11_symbols < 100:
raise RuntimeError("Didn't find enought cxx11 symbols")
raise RuntimeError("Didn't find enough cxx11 symbols")
def main() -> None:

View File

@ -276,7 +276,7 @@ def smoke_test_cuda(
torch_nccl_version = ".".join(str(v) for v in torch.cuda.nccl.version())
print(f"Torch nccl; version: {torch_nccl_version}")
# Pypi dependencies are installed on linux ony and nccl is availbale only on Linux.
# Pypi dependencies are installed on linux only and nccl is available only on Linux.
if pypi_pkg_check == "enabled" and sys.platform in ["linux", "linux2"]:
compare_pypi_to_torch_versions(
"cudnn", find_pypi_package_version("nvidia-cudnn"), torch_cudnn_version

View File

@ -196,7 +196,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Check XPU status before testing
xpu-smi discovery
timeout 30 xpu-smi discovery || true
fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
@ -224,7 +224,7 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
# TODO: Figure out how to avoid hard-coding these paths
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-15/bin/llvm-symbolizer
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-18/bin/llvm-symbolizer
export TORCH_USE_RTLD_GLOBAL=1
# NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our
# default behavior.
@ -325,6 +325,8 @@ test_python_smoke() {
test_h100_distributed() {
# Distributed tests at H100
time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
# This test requires multicast support
time python test/run_test.py --include distributed/_composable/fsdp/test_fully_shard_comm.py -k TestFullyShardAllocFromPG $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
@ -594,7 +596,9 @@ test_perf_for_dashboard() {
local device=cuda
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
if [[ "${TEST_CONFIG}" == *zen_cpu_x86* ]]; then
device=zen_cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
@ -1135,6 +1139,12 @@ test_custom_backend() {
test_custom_script_ops() {
echo "Testing custom script operators"
if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then
echo "Skipping custom script operators until it's fixed"
return 0
fi
CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"
pushd test/custom_operator
cp -a "$CUSTOM_OP_BUILD" build

View File

@ -31,7 +31,7 @@ PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"
echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."
echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."
echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the corresponding APIs instead."
echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"
echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"
echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

View File

@ -10,7 +10,7 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
:: able to see what our cl.exe commands are (since you can actually
:: just copy-paste them into a local Windows setup to just rebuild a
:: single file.)
:: log sizes are too long, but leaving this here incase someone wants to use it locally
:: log sizes are too long, but leaving this here in case someone wants to use it locally
:: set CMAKE_VERBOSE_MAKEFILE=1

View File

@ -52,7 +52,7 @@ if __name__ == "__main__":
if os.path.exists(debugger):
command_args = [debugger, "-o", "-c", "~*g; q"] + command_args
command_string = " ".join(command_args)
print("Reruning with traceback enabled")
print("Rerunning with traceback enabled")
print("Command:", command_string)
subprocess.run(command_args, check=False)
sys.exit(e.returncode)

View File

@ -65,7 +65,7 @@ for /F "usebackq delims=" %%i in (`python -c "import sys; print('{0[0]}{0[1]}'.f
if %PYVER% LSS 35 (
echo Warning: PyTorch for Python 2 under Windows is experimental.
echo Python x64 3.5 or up is recommended to compile PyTorch on Windows
echo Maybe you can create a virual environment if you have conda installed:
echo Maybe you can create a virtual environment if you have conda installed:
echo ^> conda create -n test python=3.6 pyyaml numpy
echo ^> activate test
)

View File

@ -206,7 +206,7 @@ if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
BUILD_PYTHON_ONLY=1 BUILD_LIBTORCH_WHL=0 python setup.py bdist_wheel -d "$whl_tmp_dir" --cmake
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 CMAKE_FRESH=1 python setup.py bdist_wheel -d "$whl_tmp_dir"
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
python setup.py bdist_wheel -d "$whl_tmp_dir"

View File

@ -1,157 +0,0 @@
# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0
import json
import os
import re
import sys
import time
import requests
AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"
AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
PIPELINE_ID = "911"
PROJECT_ID = "0628bce4-2d33-499e-bac5-530e12db160f"
TARGET_BRANCH = os.environ.get("CIRCLE_BRANCH", "main")
TARGET_COMMIT = os.environ.get("CIRCLE_SHA1", "")
build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"
s = requests.Session()
s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})
def submit_build(pipeline_id, project_id, source_branch, source_version):
print("Submitting build for branch: " + source_branch)
print("Commit SHA1: ", source_version)
run_build_raw = s.post(
build_base_url,
json={
"definition": {"id": pipeline_id},
"project": {"id": project_id},
"sourceBranch": source_branch,
"sourceVersion": source_version,
},
)
try:
run_build_json = run_build_raw.json()
except json.decoder.JSONDecodeError as e:
print(e)
print(
"Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired."
)
sys.exit(-1)
build_id = run_build_json["id"]
print("Submitted bulid: " + str(build_id))
print("Bulid URL: " + run_build_json["url"])
return build_id
def get_build(_id):
get_build_url = (
AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"
)
get_build_raw = s.get(get_build_url)
return get_build_raw.json()
def get_build_logs(_id):
get_build_logs_url = (
AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"
)
get_build_logs_raw = s.get(get_build_logs_url)
return get_build_logs_raw.json()
def get_log_content(url):
resp = s.get(url)
return resp.text
def wait_for_build(_id):
build_detail = get_build(_id)
build_status = build_detail["status"]
while build_status == "notStarted":
print("Waiting for run to start: " + str(_id))
sys.stdout.flush()
try:
build_detail = get_build(_id)
build_status = build_detail["status"]
except Exception as e:
print("Error getting build")
print(e)
time.sleep(30)
print("Bulid started: ", str(_id))
handled_logs = set()
while build_status == "inProgress":
try:
print("Waiting for log: " + str(_id))
logs = get_build_logs(_id)
except Exception as e:
print("Error fetching logs")
print(e)
time.sleep(30)
continue
for log in logs["value"]:
log_id = log["id"]
if log_id in handled_logs:
continue
handled_logs.add(log_id)
print("Fetching log: \n" + log["url"])
try:
log_content = get_log_content(log["url"])
print(log_content)
except Exception as e:
print("Error getting log content")
print(e)
sys.stdout.flush()
build_detail = get_build(_id)
build_status = build_detail["status"]
time.sleep(30)
build_result = build_detail["result"]
print("Bulid status: " + build_status)
print("Bulid result: " + build_result)
return build_status, build_result
if __name__ == "__main__":
# Convert the branch name for Azure DevOps
match = re.search(r"pull/(\d+)", TARGET_BRANCH)
if match is not None:
pr_num = match.group(1)
SOURCE_BRANCH = f"refs/pull/{pr_num}/head"
else:
SOURCE_BRANCH = f"refs/heads/{TARGET_BRANCH}"
MAX_RETRY = 2
retry = MAX_RETRY
while retry > 0:
build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)
build_status, build_result = wait_for_build(build_id)
if build_result != "succeeded":
retry = retry - 1
if retry > 0:
print("Retrying... remaining attempt: " + str(retry))
# Wait a bit before retrying
time.sleep((MAX_RETRY - retry) * 120)
continue
else:
print("No more chance to retry. Giving up.")
sys.exit(-1)
else:
break

View File

@ -14,6 +14,7 @@ self-hosted-runner:
- linux.12xlarge
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.24xlarge.amd
- linux.arm64.2xlarge
- linux.arm64.2xlarge.ephemeral
- linux.arm64.m7g.4xlarge
@ -49,6 +50,7 @@ self-hosted-runner:
# Organization-wide AMD-hosted runners
# MI2xx runners
- linux.rocm.gpu
- linux.rocm.gpu.mi250
- linux.rocm.gpu.2
- linux.rocm.gpu.4
# MI300 runners

View File

@ -9,7 +9,7 @@ inputs:
arch-for-build-env:
description: |
arch to pass to build environment.
This is currently different than the arch name we use elswhere, which
This is currently different than the arch name we use elsewhere, which
should be fixed.
required: true
github-secret:

View File

@ -157,4 +157,4 @@ runs:
echo "Is keep-going label set? ${{ steps.filter.outputs.keep-going }}"
echo
echo "Renabled issues? ${{ steps.filter.outputs.reenabled-issues }}"
echo "Reenabled issues? ${{ steps.filter.outputs.reenabled-issues }}"

View File

@ -153,7 +153,7 @@ runs:
github-token: ${{ inputs.GITHUB_TOKEN }}
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going

View File

@ -13,6 +13,12 @@ inputs:
github-token:
description: GitHub token
required: true
job-id:
description: Job ID
required: true
job-name:
description: Job name
required: true
outputs:
reuse:
@ -30,8 +36,11 @@ runs:
continue-on-error: true
env:
GITHUB_TOKEN: ${{ inputs.github-token }}
JOB_ID: ${{ inputs.job-id }}
JOB_NAME: ${{ inputs.job-name }}
run: |
set -x
python3 -m pip install boto3==1.35.42
python3 ${GITHUB_ACTION_PATH}/reuse_old_whl.py \
--build-environment "${{ inputs.build-environment }}" \
--run-id "${{ inputs.run-id }}" \

View File

@ -1,6 +1,7 @@
import argparse
import os
import subprocess
import sys
from functools import lru_cache
from pathlib import Path
from typing import Any, cast, Optional, Union
@ -8,6 +9,14 @@ from typing import Any, cast, Optional, Union
import requests
REPO_ROOT = Path(__file__).resolve().parent.parent.parent.parent
sys.path.insert(0, str(REPO_ROOT))
from tools.stats.upload_metrics import emit_metric
sys.path.remove(str(REPO_ROOT)) # Clean up sys.path after import
FORCE_REBUILD_LABEL = "ci-force-rebuild"
@ -123,17 +132,26 @@ def check_changed_files(sha: str) -> bool:
# Return true if all the changed files are in the list of allowed files to
# be changed to reuse the old whl
# Removing any files is not allowed since rysnc will not remove files
# Removing files in the torch folder is not allowed since rsync will not
# remove files
removed_files = (
subprocess.check_output(
["git", "diff", "--name-only", sha, "HEAD", "--diff-filter=D"],
[
"git",
"diff",
"--name-only",
sha,
"HEAD",
"--diff-filter=D",
"--no-renames",
],
text=True,
stderr=subprocess.DEVNULL,
)
.strip()
.split()
)
if removed_files:
if any(file.startswith("torch/") for file in removed_files):
print(
f"Removed files between {sha} and HEAD: {removed_files}, cannot reuse old whl"
)
@ -141,7 +159,7 @@ def check_changed_files(sha: str) -> bool:
changed_files = (
subprocess.check_output(
["git", "diff", "--name-only", sha, "HEAD"],
["git", "diff", "--name-only", sha, "HEAD", "--no-renames"],
text=True,
stderr=subprocess.DEVNULL,
)
@ -308,7 +326,7 @@ def parse_args() -> argparse.Namespace:
return parser.parse_args()
def can_reuse_whl(args: argparse.Namespace) -> bool:
def can_reuse_whl(args: argparse.Namespace) -> tuple[bool, str]:
if args.github_ref and any(
args.github_ref.startswith(x)
for x in [
@ -318,37 +336,50 @@ def can_reuse_whl(args: argparse.Namespace) -> bool:
]
):
print("Release branch, rebuild whl")
return False
return (False, "Release branch")
if not check_changed_files(get_merge_base()):
print("Cannot use old whl due to the changed files, rebuild whl")
return False
return (False, "Changed files not allowed")
if check_labels_for_pr():
print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")
return False
return (False, "Found FORCE_REBUILD_LABEL on PR")
if check_issue_open():
print("Issue #153759 is open, rebuild whl")
return False
return (False, "Issue #153759 is open")
workflow_id = get_workflow_id(args.run_id)
if workflow_id is None:
print("No workflow ID found, rebuild whl")
return False
return (False, "No workflow ID found")
if not find_old_whl(workflow_id, args.build_environment, get_merge_base()):
print("No old whl found, rebuild whl")
return (False, "No old whl found")
# TODO: go backwards from merge base to find more runs
return False
return True
return (True, "Found old whl")
if __name__ == "__main__":
args = parse_args()
if can_reuse_whl(args):
reuse_whl, reason = can_reuse_whl(args)
if reuse_whl:
print("Reusing old whl")
unzip_artifact_and_replace_files()
set_output()
emit_metric(
"reuse_old_whl",
{
"reuse_whl": reuse_whl,
"reason": reason,
"build_environment": args.build_environment,
"merge_base": get_merge_base(),
"head_sha": get_head_sha(),
},
)

View File

@ -33,14 +33,14 @@ runs:
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Start docker if docker deamon is not running
- name: Start docker if docker daemon is not running
shell: bash
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";
else
echo "Starting docker deamon..." && sudo systemctl start docker;
echo "Starting docker daemon..." && sudo systemctl start docker;
fi
- name: Log in to ECR

View File

@ -1 +1 @@
4cb7f57d31b0b288696f09b89e890e5fac092eed
4e94321c54617dd738a05bfedfc28bc0fa635b5c

View File

@ -1 +1 @@
9a517a95f620dc0960d1feec97b50ac6ea7f1854
55a75404c9b75cd5fd62ab5d4deafc8c506b3af2

1
.github/labeler.yml vendored
View File

@ -116,7 +116,6 @@
"release notes: inductor (aoti)":
- torch/_C/_aoti.pyi
- torch/_dynamo/repro/aoti.py
- torch/_export/serde/aoti_schema.py
- torch/_higher_order_ops/aoti_call_delegate.py
- torch/_inductor/codegen/aoti_runtime/**
- torch/_inductor/codegen/aoti_hipify_utils.py

View File

@ -11,6 +11,7 @@ ciflow_push_tags:
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-cu126
- ciflow/linux-aarch64
- ciflow/mps

View File

@ -10,5 +10,5 @@ lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84
pyyaml==6.0
requests==2.32.2
requests==2.32.4
rich==10.9.0

View File

@ -1,5 +1,5 @@
boto3==1.35.42
cmake==3.25.*
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.6.0

View File

@ -78,7 +78,7 @@ for pkg in /$WHEELHOUSE_DIR/*triton*.whl; do
echo "Copied $filepath to $patchedpath"
done
# Go through all required shared objects and see if any of our other objects are dependants. If so, replace so.ver wth so
# Go through all required shared objects and see if any of our other objects are dependants. If so, replace so.ver with so
for ((i=0;i<${#deps[@]};++i)); do
echo "replacing "${deps_soname[i]} ${patched[i]}
replace_needed_sofiles $PREFIX/$ROCM_LIB ${deps_soname[i]} ${patched[i]}

View File

@ -21,8 +21,11 @@ def read_triton_pin(device: str = "cuda") -> str:
return f.read().strip()
def read_triton_version() -> str:
with open(REPO_DIR / ".ci" / "docker" / "triton_version.txt") as f:
def read_triton_version(device: str = "cuda") -> str:
triton_version_file = "triton_version.txt"
if device == "xpu":
triton_version_file = "triton_xpu_version.txt"
with open(REPO_DIR / ".ci" / "docker" / triton_version_file) as f:
return f.read().strip()
@ -91,7 +94,7 @@ def build_triton(
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}",
expected_version=None,
expected_version=read_triton_version(device),
)
if device == "rocm":
@ -137,15 +140,19 @@ def main() -> None:
parser.add_argument("--py-version", type=str)
parser.add_argument("--commit-hash", type=str)
parser.add_argument("--with-clang-ldd", action="store_true")
parser.add_argument("--triton-version", type=str, default=read_triton_version())
parser.add_argument("--triton-version", type=str, default=None)
args = parser.parse_args()
triton_version = read_triton_version(args.device)
if args.triton_version:
triton_version = args.triton_version
build_triton(
device=args.device,
commit_hash=(
args.commit_hash if args.commit_hash else read_triton_pin(args.device)
),
version=args.triton_version,
version=triton_version,
py_version=args.py_version,
release=args.release,
with_clang_ldd=args.with_clang_ldd,

View File

@ -38,7 +38,7 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCHES = ["12.8-aarch64"]
CUDA_AARCH64_ARCHES = ["12.9-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
@ -53,7 +53,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -70,7 +70,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -86,8 +86,8 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -222,9 +222,6 @@ def generate_libtorch_matrix(
if os == "linux":
arches += CUDA_ARCHES
arches += ROCM_ARCHES
# will add in a separate PR for 12.9
if "12.9" in arches:
arches.remove("12.9")
elif os == "windows":
arches += CUDA_ARCHES
if "12.9" in arches:

View File

@ -152,7 +152,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["12.6", "12.8", "12.9"],
arches=["12.6", "12.8", "12.9", "6.4"],
python_versions=["3.9"],
),
branches="main",

View File

@ -64,7 +64,7 @@ def fetch_url(
)
exception_message = (
"Is github alright?",
f"Recieved status code '{err.code}' when attempting to retrieve {url}:\n",
f"Received status code '{err.code}' when attempting to retrieve {url}:\n",
f"{err.reason}\n\nheaders={err.headers}",
)
raise RuntimeError(exception_message) from err

View File

@ -211,7 +211,7 @@ class GitRepo:
self, from_branch: str, to_branch: str
) -> tuple[list[str], list[str]]:
"""
Returns list of commmits that are missing in each other branch since their merge base
Returns list of commits that are missing in each other branch since their merge base
Might be slow if merge base is between two branches is pretty far off
"""
from_ref = self.rev_parse(from_branch)

View File

@ -12,7 +12,7 @@ BASE=${BASE:-HEAD~1}
HEAD=${HEAD:-HEAD}
ancestor=$(git merge-base "${BASE}" "${HEAD}")
echo "INFO: Checking aginst the following stats"
echo "INFO: Checking against the following stats"
(
set -x
git diff --stat=10000 "$ancestor" "${HEAD}" | sed '$d' > "${TMPFILE}"

View File

@ -347,26 +347,26 @@ class TestConfigFilter(TestCase):
{
"job_name": "a-ci-job",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Replicate each periodic mode in a different config",
"description": "Replicate each periodic mode in a different config",
},
{
"job_name": "a-ci-cuda11.8-job",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Replicate each periodic mode in a different config for a CUDA job",
"description": "Replicate each periodic mode in a different config for a CUDA job",
},
{
"job_name": "a-ci-rocm-job",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Replicate each periodic mode in a different config for a ROCm job",
"description": "Replicate each periodic mode in a different config for a ROCm job",
},
{
"job_name": "",
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Empty job name",
"description": "Empty job name",
},
{
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", runner: "macos"}]}',
"descripion": "Missing job name",
"description": "Missing job name",
},
]

View File

@ -265,7 +265,7 @@ class DummyGitRepo(GitRepo):
return ["FakeCommitSha"]
def commit_message(self, ref: str) -> str:
return "super awsome commit message"
return "super awesome commit message"
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@ -433,7 +433,7 @@ class TestTryMerge(TestCase):
)
def test_cancelled_gets_ignored(self, *args: Any) -> None:
"""Tests that cancelled workflow does not override existing successfull status"""
"""Tests that cancelled workflow does not override existing successful status"""
pr = GitHubPR("pytorch", "pytorch", 110367)
conclusions = pr.get_checkrun_conclusions()
lint_checks = [name for name in conclusions.keys() if "Lint" in name]

View File

@ -634,7 +634,7 @@ def _revlist_to_prs(
raise RuntimeError(
f"Found an unexpected number of PRs mentioned in commit {rev}: "
f"{len(all_matches)}. This is probably because you are using an "
"old verion of ghstack. Please update ghstack and resubmit "
"old version of ghstack. Please update ghstack and resubmit "
"your PRs"
)

View File

@ -41,7 +41,7 @@ if "%CUVER_NODOT%" == "129" (
if "%CUVER_NODOT%" == "128" (
set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
)
if "%CUVER_NODOT:~0,2%" == "12" if NOT "%CUVER_NODOT%" == "128" (
if "%CUVER_NODOT%" == "126" (
set CUDA_ARCH_LIST=-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
)

View File

@ -15,4 +15,4 @@ call conda run -n %PYTHON_PREFIX% pip install wheel pybind11 certifi cython cmak
dir "%VC_INSTALL_PATH%"
call "%VC_INSTALL_PATH%\VC\Auxiliary\Build\vcvarsall.bat" x64
call conda run -n %PYTHON_PREFIX% python .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %TRITON_VERSION% %RELEASE%
call conda run -n %PYTHON_PREFIX% python .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %RELEASE%

View File

@ -171,7 +171,7 @@ jobs:
- name: Teardown XPU
uses: ./.github/actions/teardown-xpu
{%- else %}
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config) }}
steps:

View File

@ -158,6 +158,13 @@ jobs:
role-session-name: gha-linux-build
aws-region: us-east-1
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Check if can use old whl build
id: use-old-whl
uses: ./.github/actions/reuse-old-whl
@ -166,6 +173,8 @@ jobs:
build-environment: ${{ inputs.build-environment }}
run-id: ${{ github.run_id }}
github-token: ${{ secrets.GITHUB_TOKEN }}
job-id: ${{ steps.get-job-id.outputs.job-id }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Calculate docker image
id: calculate-docker-image
@ -181,7 +190,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
@ -194,13 +203,6 @@ jobs:
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
# Apply the filter logic to the build step too if the test-config label is already there
- name: Select all requested test configurations (if the test matrix is available)
id: filter

View File

@ -131,7 +131,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
@ -207,7 +207,7 @@ jobs:
run: .github/scripts/parse_ref.py
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going
@ -386,6 +386,7 @@ jobs:
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
benchmark-results-dir: test/test-reports
dry-run: false

View File

@ -153,7 +153,7 @@ jobs:
run: .github/scripts/parse_ref.py
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going

View File

@ -150,7 +150,7 @@ jobs:
run: .github/scripts/parse_ref.py
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going

View File

@ -7,7 +7,7 @@ on:
required: false
type: string
description: |
List of experiments for this workfow. If not defined, all default experiments are included.
List of experiments for this workflow. If not defined, all default experiments are included.
opt_out_experiments:
required: false
type: string

View File

@ -158,7 +158,7 @@ jobs:
uses: ./.github/actions/download-td-artifacts
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going

View File

@ -105,7 +105,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
@ -147,7 +147,7 @@ jobs:
run: .github/scripts/parse_ref.py
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# This uses the filter-test-configs action because it conveniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going

View File

@ -29,7 +29,7 @@ concurrency:
jobs:
build-linux-magma-rocm:
if: github.repository_owner == 'pytorch'
runs-on: linux.12xlarge
runs-on: linux.2xlarge
permissions:
id-token: write
strategy:

View File

@ -53,7 +53,7 @@ jobs:
docker tag "${CREATED_FULL_DOCKER_IMAGE_NAME}" "${DOCKER_IMAGE_NAME_PREFIX}-${GIT_COMMIT_SHA}"
docker tag "${CREATED_FULL_DOCKER_IMAGE_NAME}" "${DOCKER_IMAGE_NAME_PREFIX}-${CI_FOLDER_SHA}"
# Prety sure Github will mask tokens and I'm not sure if it will even be
# Pretty sure Github will mask tokens and I'm not sure if it will even be
# printed due to pipe, but just in case
set +x
if [[ "${WITH_PUSH:-false}" == "true" ]]; then

View File

@ -156,9 +156,8 @@ jobs:
fi
if [[ "${BUILD_DEVICE}" == xpu ]]; then
TRITON_VERSION=$(cat .ci/docker/triton_xpu_version.txt)
docker exec -t "${container_name}" bash -c "dnf install -y gcc-toolset-13-gcc-c++"
docker exec -t "${container_name}" bash -c "source /opt/rh/gcc-toolset-13/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE --triton-version=$TRITON_VERSION $RELEASE"
docker exec -t "${container_name}" bash -c "source /opt/rh/gcc-toolset-13/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"
else
docker exec -t "${container_name}" bash -c "${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE $WITH_CLANG_LDD"
fi
@ -255,11 +254,6 @@ jobs:
if [[ "${IS_RELEASE_TAG}" == true ]]; then
export RELEASE="--release"
fi
export TRITON_VERSION=""
if [[ "${{ matrix.device }}" == "xpu" ]]; then
triton_version=$(cat .ci/docker/ci_commit_pins/triton-xpu.txt)
export TRITON_VERSION="--triton-version=${triton_version}"
fi
.github/scripts/windows/build_triton.bat
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv ./*.whl "${RUNNER_TEMP}/artifacts/"

View File

@ -156,7 +156,7 @@ jobs:
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
# Please note, here we ned to pin specific verison of CUDA as with latest label
# Please note, here we need to pin specific version of CUDA as with latest label
if [[ ${CUDA_VERSION_SHORT} == "${STABLE_CUDA_VERSION}" ]]; then
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" \
ghcr.io/pytorch/pytorch-nightly:latest

View File

@ -115,7 +115,7 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cuda-aarch64-12_8-build:
manywheel-py3_9-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -124,41 +124,41 @@ jobs:
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_8
build_name: manywheel-py3_9-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda-aarch64-12_8-upload: # Uploading
manywheel-py3_9-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_9-cuda-aarch64-12_8-build
needs: manywheel-py3_9-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda-aarch64-12_8
build_name: manywheel-py3_9-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
@ -231,7 +231,7 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_8-build:
manywheel-py3_10-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -240,41 +240,41 @@ jobs:
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_name: manywheel-py3_10-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_8-upload: # Uploading
manywheel-py3_10-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_8-build
needs: manywheel-py3_10-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_name: manywheel-py3_10-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
@ -347,7 +347,7 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_8-build:
manywheel-py3_11-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -356,41 +356,41 @@ jobs:
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_name: manywheel-py3_11-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_8-upload: # Uploading
manywheel-py3_11-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_8-build
needs: manywheel-py3_11-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_name: manywheel-py3_11-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
@ -463,7 +463,7 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_8-build:
manywheel-py3_12-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -472,41 +472,41 @@ jobs:
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_name: manywheel-py3_12-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_8-upload: # Uploading
manywheel-py3_12-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_8-build
needs: manywheel-py3_12-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_name: manywheel-py3_12-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
@ -579,7 +579,7 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_8-build:
manywheel-py3_13-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -588,41 +588,41 @@ jobs:
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_name: manywheel-py3_13-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_8-upload: # Uploading
manywheel-py3_13-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_8-build
needs: manywheel-py3_13-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_name: manywheel-py3_13-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
@ -695,7 +695,7 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_8-build:
manywheel-py3_13t-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
@ -704,41 +704,41 @@ jobs:
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_name: manywheel-py3_13t-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_8-upload: # Uploading
manywheel-py3_13t-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_8-build
needs: manywheel-py3_13t-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_name: manywheel-py3_13t-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -248,6 +248,74 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cuda12_9-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_9-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-release-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cuda12_9-shared-with-deps-release
build_environment: linux-binary-libtorch
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 and 12.9 build need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_9-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cuda12_9-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_3-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -274,7 +342,7 @@ jobs:
needs:
- libtorch-rocm6_3-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -388,7 +456,7 @@ jobs:
needs:
- libtorch-rocm6_4-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch

View File

@ -61,7 +61,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing
@ -108,7 +108,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -155,7 +155,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_9-test: # Testing
@ -182,3 +182,95 @@ jobs:
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 and 12.9 build need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-rocm6_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: 6.4
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-rocm6_4
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-rocm6_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: 6.4
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
use_split_build: False
DESIRED_PYTHON: "3.9"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: manywheel-py3_9-rocm6_4
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: manylinux2_28-builder
custom-tag-prefix: rocm6.4
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm

View File

@ -131,7 +131,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing
@ -200,7 +200,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -269,7 +269,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_9-test: # Testing
@ -345,7 +345,7 @@ jobs:
needs:
- manywheel-py3_9-rocm6_3-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -459,7 +459,7 @@ jobs:
needs:
- manywheel-py3_9-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -744,7 +744,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_6-test: # Testing
@ -813,7 +813,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing
@ -882,7 +882,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_9-test: # Testing
@ -958,7 +958,7 @@ jobs:
needs:
- manywheel-py3_10-rocm6_3-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -1072,7 +1072,7 @@ jobs:
needs:
- manywheel-py3_10-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -1357,7 +1357,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-test: # Testing
@ -1494,7 +1494,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing
@ -1563,7 +1563,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_9-test: # Testing
@ -1639,7 +1639,7 @@ jobs:
needs:
- manywheel-py3_11-rocm6_3-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -1753,7 +1753,7 @@ jobs:
needs:
- manywheel-py3_11-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -2038,7 +2038,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_6-test: # Testing
@ -2107,7 +2107,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
@ -2176,7 +2176,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_9-test: # Testing
@ -2252,7 +2252,7 @@ jobs:
needs:
- manywheel-py3_12-rocm6_3-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -2366,7 +2366,7 @@ jobs:
needs:
- manywheel-py3_12-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -2651,7 +2651,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_6-test: # Testing
@ -2720,7 +2720,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing
@ -2789,7 +2789,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_9-test: # Testing
@ -2865,7 +2865,7 @@ jobs:
needs:
- manywheel-py3_13-rocm6_3-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -2979,7 +2979,7 @@ jobs:
needs:
- manywheel-py3_13-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -3264,7 +3264,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_6-test: # Testing
@ -3333,7 +3333,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing
@ -3402,7 +3402,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_9-test: # Testing
@ -3478,7 +3478,7 @@ jobs:
needs:
- manywheel-py3_13t-rocm6_3-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
@ -3592,7 +3592,7 @@ jobs:
needs:
- manywheel-py3_13t-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch

View File

@ -25,15 +25,15 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_6-py3_10-gcc11-sm90-build:
name: linux-focal-cuda12.6-py3.10-gcc11-sm90
linux-jammy-cuda12_8-py3_10-gcc11-sm90-build:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: "linux.12xlarge"
build-environment: linux-focal-cuda12.6-py3.10-gcc11-sm90
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'
test-matrix: |
{ include: [
@ -41,13 +41,13 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc11-sm90-test:
name: linux-focal-cuda12.6-py3.10-gcc11-sm90
linux-jammy-cuda12_8-py3_10-gcc11-sm90-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_6-py3_10-gcc11-sm90-build
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc11-sm90
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc11-sm90-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc11-sm90-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.test-matrix }}
secrets: inherit

View File

@ -2,7 +2,7 @@ name: inductor-perf-nightly-h100
on:
schedule:
- cron: 15 0,8,16 * * 1-6
- cron: 15 0,4,8,12,16,20 * * 1-6
- cron: 0 7 * * 0
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
@ -94,18 +94,22 @@ jobs:
{ config: "inductor_huggingface_perf_cuda_h100", shard: 3, num_shards: 5, runner: "linux.aws.h100" },
{ config: "inductor_huggingface_perf_cuda_h100", shard: 4, num_shards: 5, runner: "linux.aws.h100" },
{ config: "inductor_huggingface_perf_cuda_h100", shard: 5, num_shards: 5, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 1, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 2, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 3, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 4, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 5, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 6, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 1, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 2, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 3, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 4, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 5, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 6, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 1, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 2, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 3, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 4, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 5, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 6, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 7, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 1, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 2, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 3, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 4, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 5, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 6, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 7, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 8, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 9, num_shards: 9, runner: "linux.aws.h100" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets: inherit
@ -114,7 +118,7 @@ jobs:
name: cuda12.8-py3.10-gcc9-sm90
uses: ./.github/workflows/_linux-test.yml
needs: build
if: github.event.schedule == '15 0,8,16 * * 1-6'
if: github.event.schedule == '15 0,4,8,12,16,20 * * 1-6'
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true

View File

@ -0,0 +1,129 @@
name: inductor-perf-nightly-x86-zen
on:
push:
tags:
- ciflow/inductor-perf-test-nightly-x86-zen/*
schedule:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
# Does not perform max_autotune on CPU, so skip the weekly run setup
- cron: 0 7 * * *
# NB: GitHub has an upper limit of 10 inputs here
workflow_dispatch:
inputs:
training:
# CPU for training is not typical, but leave the option open here
description: Run training (off by default)?
required: false
type: boolean
default: false
inference:
description: Run inference (on by default)?
required: false
type: boolean
default: true
default:
description: Run inductor_default?
required: false
type: boolean
default: true
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
cppwrapper:
description: Run inductor_cpp_wrapper?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_zen_cpu_x86,inductor_timm_perf_zen_cpu_x86,inductor_torchbench_perf_zen_cpu_x86
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-jammy-zen-cpu-py3_9-gcc11-inductor-build:
name: linux-jammy-zen-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_zen_cpu_x86", shard: 1, num_shards: 3, runner: "linux.24xlarge.amd" },
{ config: "inductor_huggingface_perf_zen_cpu_x86", shard: 2, num_shards: 3, runner: "linux.24xlarge.amd" },
{ config: "inductor_huggingface_perf_zen_cpu_x86", shard: 3, num_shards: 3, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_zen_cpu_x86", shard: 1, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_zen_cpu_x86", shard: 2, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_zen_cpu_x86", shard: 3, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_zen_cpu_x86", shard: 4, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_zen_cpu_x86", shard: 5, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_zen_cpu_x86", shard: 1, num_shards: 4, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_zen_cpu_x86", shard: 2, num_shards: 4, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_zen_cpu_x86", shard: 3, num_shards: 4, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_zen_cpu_x86", shard: 4, num_shards: 4, runner: "linux.24xlarge.amd" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets: inherit
linux-jammy-zen-cpu-py3_9-gcc11-inductor-test-nightly:
name: linux-jammy-zen-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-zen-cpu-py3_9-gcc11-inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.9-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true
docker-image: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
timeout-minutes: 720
# disable monitor in perf tests
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit
linux-jammy-zen-cpu-py3_9-gcc11-inductor-test:
name: linux-jammy-zen-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-zen-cpu-py3_9-gcc11-inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.9-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}
docker-image: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
timeout-minutes: 720
# disable monitor in perf tests
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit

View File

@ -1,6 +1,6 @@
# Workflow: Inductor Unit Test
# 1. runs unit tests for inductor.
# 2. perfoms daily memory leak checks and reruns of disabled tests, scheduled at `29 8 * * *`.
# 2. performs daily memory leak checks and reruns of disabled tests, scheduled at `29 8 * * *`.
name: inductor-unittest
on:

View File

@ -34,13 +34,9 @@ jobs:
runner_prefix: ${{ needs.get-label-type.outputs.label-type }}
build-environment: linux-jammy-aarch64-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-aarch64-py3.10-gcc11
runner: linux.arm64.2xlarge
runner: linux.arm64.m7g.4xlarge
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.m7g.4xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.m7g.4xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.m7g.4xlarge" },

View File

@ -120,14 +120,14 @@ jobs:
]}
secrets: inherit
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
linux-jammy-py3_10-clang18-asan-build:
name: linux-jammy-py3.10-clang18-asan
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang15-asan
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
@ -141,16 +141,16 @@ jobs:
secrets: inherit
linux-jammy-py3_10-clang15-asan-test:
name: linux-jammy-py3.10-clang15-asan
linux-jammy-py3_10-clang18-asan-test:
name: linux-jammy-py3.10-clang18-asan
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-py3_10-clang15-asan-build
- linux-jammy-py3_10-clang18-asan-build
- target-determination
with:
build-environment: linux-jammy-py3.10-clang15-asan
docker-image: ${{ needs.linux-jammy-py3_10-clang15-asan-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_10-clang15-asan-build.outputs.test-matrix }}
build-environment: linux-jammy-py3.10-clang18-asan
docker-image: ${{ needs.linux-jammy-py3_10-clang18-asan-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_10-clang18-asan-build.outputs.test-matrix }}
sync-tag: asan-test
secrets: inherit

View File

@ -133,14 +133,14 @@ jobs:
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-py3_10-clang15-asan-build:
name: linux-jammy-py3.10-clang15-asan
linux-jammy-py3_10-clang18-asan-build:
name: linux-jammy-py3.10-clang18-asan
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang15-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang15-asan
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
@ -150,15 +150,15 @@ jobs:
sync-tag: asan-build
secrets: inherit
linux-jammy-py3_10-clang15-asan-test:
name: linux-jammy-py3.10-clang15-asan
linux-jammy-py3_10-clang18-asan-test:
name: linux-jammy-py3.10-clang18-asan
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-py3_10-clang15-asan-build
- linux-jammy-py3_10-clang18-asan-build
- target-determination
with:
build-environment: linux-jammy-py3.10-clang15-asan
docker-image: ${{ needs.linux-jammy-py3_10-clang15-asan-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_10-clang15-asan-build.outputs.test-matrix }}
build-environment: linux-jammy-py3.10-clang18-asan
docker-image: ${{ needs.linux-jammy-py3_10-clang18-asan-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_10-clang18-asan-build.outputs.test-matrix }}
sync-tag: asan-test
secrets: inherit

View File

@ -46,7 +46,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image

View File

@ -102,7 +102,7 @@ jobs:
s3-prefix: merges/${{ github.repository }}/${{ github.event.client_payload.pr_num }}/${{ github.event.client_payload.comment_id }}/${{ github.run_id }}
path: merge_record.json
# We want newer merge commands to supercede old ones
# We want newer merge commands to supersede old ones
concurrency:
group: try-merge-${{ github.event.client_payload.pr_num }}
cancel-in-progress: true

View File

@ -154,21 +154,21 @@ init_command = [
'numpy==1.26.4 ; python_version >= "3.9" and python_version <= "3.11"',
'numpy==2.1.0 ; python_version >= "3.12"',
'expecttest==0.3.0',
'mypy==1.15.0',
'mypy==1.16.0',
'sympy==1.13.3',
'types-requests==2.27.25',
'types-PyYAML==6.0.7',
'types-pyyaml==6.0.1',
'types-tabulate==0.8.8',
'types-protobuf==5.29.1.20250403',
'types-pkg-resources==0.1.3',
'types-Jinja2==2.11.9',
'types-setuptools==79.0.0.20250422',
'types-jinja2==2.11.9',
'types-colorama==0.4.6',
'filelock==3.13.1',
'junitparser==2.1.1',
'rich==10.9.0',
'pyyaml==6.0.1',
'optree==0.13.0',
'dataclasses_json==0.6.7',
'dataclasses-json==0.6.7',
'pandas==2.2.3',
]
@ -1134,6 +1134,45 @@ init_command = [
'PyYAML==6.0.1',
]
[[linter]]
code = 'CODESPELL'
command = [
'python3',
'tools/linter/adapters/codespell_linter.py',
'--',
'@{{PATHSFILE}}'
]
include_patterns = [
'**',
]
exclude_patterns = [
# We don't care too much about files in this directory, don't enforce
# spelling on them
'caffe2/**',
'fb/**',
'**/fb/**',
'third_party/**',
'test/dynamo/cpython/**',
'torch/_vendor/**',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
# These files are all grandfathered in, feel free to remove from this list
# as necessary
'aten/**',
'docs/**',
'functorch/**',
'scripts/**',
'test/**',
'torch/**',
]
init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'codespell[toml]==2.4.1',
]
is_formatter = true
# usort + ruff-format
[[linter]]
code = 'PYFMT'
@ -1267,10 +1306,6 @@ exclude_patterns = [
'test/test_unary_ufuncs.py',
'test/test_vulkan.py',
'torch/_awaits/__init__.py',
'torch/_custom_op/__init__.py',
'torch/_custom_op/autograd.py',
'torch/_custom_op/functional.py',
'torch/_custom_op/impl.py',
'torch/_export/__init__.py',
'torch/_export/constraints.py',
'torch/_export/db/__init__.py',
@ -1308,17 +1343,6 @@ exclude_patterns = [
'torch/_export/db/examples/type_reflection_method.py',
'torch/_export/db/gen_example.py',
'torch/_export/db/logging.py',
'torch/_export/error.py',
'torch/_export/exported_program.py',
'torch/_export/pass_base.py',
'torch/_export/pass_infra/__init__.py',
'torch/_export/pass_infra/node_metadata.py',
'torch/_export/pass_infra/proxy_value.py',
'torch/_export/passes/__init__.py',
'torch/_export/passes/add_runtime_assertions_for_constraints_pass.py',
'torch/_export/passes/const_prop_pass.py',
'torch/_export/passes/functionalize_side_effectful_ops_pass.py',
'torch/_export/passes/replace_sym_size_ops_pass.py',
'torch/testing/_internal/__init__.py',
'torch/testing/_internal/autocast_test_lists.py',
'torch/testing/_internal/autograd_function_db.py',
@ -1358,19 +1382,6 @@ exclude_patterns = [
'torch/testing/_internal/test_module/__init__.py',
'torch/testing/_internal/test_module/future_div.py',
'torch/testing/_internal/test_module/no_future_div.py',
'torch/utils/_contextlib.py',
'torch/utils/_cpp_extension_versioner.py',
'torch/utils/_crash_handler.py',
'torch/utils/_device.py',
'torch/utils/_foreach_utils.py',
'torch/utils/_freeze.py',
'torch/utils/_mode_utils.py',
'torch/utils/_python_dispatch.py',
'torch/utils/_stats.py',
'torch/utils/_traceback.py',
'torch/utils/_zip.py',
'torch/utils/backcompat/__init__.py',
'torch/utils/backend_registration.py',
'torch/utils/benchmark/__init__.py',
'torch/utils/benchmark/examples/__init__.py',
'torch/utils/benchmark/examples/compare.py',
@ -1443,8 +1454,8 @@ init_command = [
'--no-black-binary',
'black==23.12.1',
'usort==1.0.8.post1',
'isort==5.13.2',
'ruff==0.11.10', # sync with RUFF
'isort==6.0.1',
'ruff==0.11.13', # sync with RUFF
]
is_formatter = true
@ -1535,7 +1546,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.11.10', # sync with PYFMT
'ruff==0.11.13', # sync with PYFMT
]
is_formatter = true

View File

@ -1,5 +1,11 @@
{
"recommendations": [
"ms-python.python",
]
"recommendations": [
"ms-python.python",
"charliermarsh.ruff",
"ms-python.flake8",
"ms-python.mypy-type-checker",
"ms-vscode.cmake-tools",
"EditorConfig.EditorConfig",
"streetsidesoftware.code-spell-checker",
]
}

View File

@ -1,15 +1,53 @@
{
"[python]": {
"editor.tabSize": 4
},
"files.associations": {
".clang-format": "yaml",
".clang-tidy": "yaml",
".flake8": "ini",
".coveragerc": "ini",
"*.py.in": "python",
"*.pyi.in": "python"
"*.pyi.in": "python",
"*requirements*.txt": "pip-requirements",
"*requirements*.in": "pip-requirements",
"*.cpp.in": "cpp",
"*.h.in": "cpp",
"*.cmake.in": "cmake",
"Makefile.*": "makefile",
"*.Makefile": "makefile",
"BUCK": "starlark",
"BUCK.*": "starlark"
},
"files.eol": "\n",
"files.insertFinalNewline": true,
"files.trimFinalNewlines": true,
"files.trimTrailingWhitespace": true,
"python.linting.enabled": true,
"python.linting.flake8Enabled": true
"cmake.preferredGenerators": [
"Ninja",
"Unix Makefiles"
],
"cmake.configureEnvironment": {
"CMAKE_EXPORT_COMPILE_COMMANDS": "ON"
},
"cmake.sourceDirectory": "${workspaceFolder}",
"cmake.buildDirectory": "${workspaceFolder}/build",
"cmake.configureArgs": [
"-DPython_EXECUTABLE=${workspaceFolder}/venv/bin/python",
"-DPython_ROOT_DIR=${workspaceFolder}/venv"
],
"[python]": {
"editor.tabSize": 4,
"editor.defaultFormatter": "charliermarsh.ruff"
},
"python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
"python.analysis.inlayHints.functionReturnTypes": true,
"flake8.importStrategy": "fromEnvironment",
"flake8.args": [
"--append-config=${workspaceFolder}/.flake8"
],
"ruff.importStrategy": "fromEnvironment",
"ruff.lineLength": 88,
"ruff.organizeImports": false,
"ruff.configurationPreference": "filesystemFirst",
"mypy-type-checker.importStrategy": "fromEnvironment",
"mypy-type-checker.preferDaemon": true,
"mypy-type-checker.reportingScope": "workspace"
}

View File

@ -500,7 +500,7 @@ filegroup(
# To achieve finer granularity and make debug easier, caffe2 is split into three libraries:
# ATen, caffe2 and caffe2_for_aten_headers. ATen lib group up source codes under
# aten/ directory and caffe2 contains most files under `caffe2/` directory. Since the
# ATen lib and the caffe2 lib would depend on each other, `caffe2_for_aten_headers` is splitted
# ATen lib and the caffe2 lib would depend on each other, `caffe2_for_aten_headers` is split
# out from `caffe2` to avoid dependency cycle.
cc_library(
name = "caffe2_for_aten_headers",

View File

@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
cmake_minimum_required(VERSION 3.27 FATAL_ERROR)
# cmake_policy(SET CMP0022 NEW) cmake_policy(SET CMP0023 NEW)
# Use compiler ID "AppleClang" instead of "Clang" for XCode. Not setting this
@ -6,6 +6,7 @@ cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
# one is detected as "AppleClang".
cmake_policy(SET CMP0010 NEW)
cmake_policy(SET CMP0025 NEW)
cmake_policy(SET CMP0126 OLD)
# Enables CMake to set LTO on compilers other than Intel.
cmake_policy(SET CMP0069 NEW)
@ -16,6 +17,8 @@ cmake_policy(SET CMP0069 NEW)
# we do this (and we don't if cmake is old), but it's nice when it's possible,
# and it's possible on our Windows configs.
cmake_policy(SET CMP0092 NEW)
# Don't remove the FindCUDA module
cmake_policy(SET CMP0146 OLD)
# Prohibit in-source builds
if(${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_BINARY_DIR})
@ -559,6 +562,10 @@ if(MSVC)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler /Zc:__cplusplus")
set(CMAKE_NINJA_CMCLDEPS_RC OFF)
if(MSVC_Z7_OVERRIDE)
# CMake set debug flags to use /Z7
set(CMAKE_MSVC_DEBUG_INFORMATION_FORMAT Embedded)
endif()
foreach(
flag_var
CMAKE_C_FLAGS
@ -571,12 +578,6 @@ if(MSVC)
CMAKE_CXX_FLAGS_RELEASE
CMAKE_CXX_FLAGS_MINSIZEREL
CMAKE_CXX_FLAGS_RELWITHDEBINFO)
# Replace /Zi and /ZI with /Z7
if(MSVC_Z7_OVERRIDE)
if(${flag_var} MATCHES "/Z[iI]")
string(REGEX REPLACE "/Z[iI]" "/Z7" ${flag_var} "${${flag_var}}")
endif(${flag_var} MATCHES "/Z[iI]")
endif(MSVC_Z7_OVERRIDE)
if(${CAFFE2_USE_MSVC_STATIC_RUNTIME})
if(${flag_var} MATCHES "/MD")
@ -699,7 +700,7 @@ endif()
if(USE_KLEIDIAI AND CMAKE_C_COMPILER_VERSION)
if(CMAKE_C_COMPILER_VERSION VERSION_LESS 11)
set(USE_KLEIDIAI OFF)
message(WARNING "Disabling KleidiAI: Requires atleast GCC 11 or Clang 11")
message(WARNING "Disabling KleidiAI: Requires at least GCC 11 or Clang 11")
endif()
endif()
@ -984,6 +985,9 @@ endif()
# ---[ Build flags Re-include to override append_cxx_flag_if_supported from
# third_party/FBGEMM
include(cmake/public/utils.cmake)
if(USE_COLORIZE_OUTPUT)
set(CMAKE_COLOR_DIAGNOSTICS ON)
endif()
if(NOT MSVC)
string(APPEND CMAKE_CXX_FLAGS " -O2 -fPIC")
@ -1059,19 +1063,6 @@ if(NOT MSVC)
CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Qunused-arguments" CMAKE_CXX_FLAGS)
if(${USE_COLORIZE_OUTPUT})
# Why compiler checks are necessary even when `try_compile` is used Because
# of the bug in ccache that can incorrectly identify `-fcolor-diagnostics`
# As supported by GCC, see https://github.com/ccache/ccache/issues/740 (for
# older ccache) and https://github.com/ccache/ccache/issues/1275 (for newer
# ones)
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
append_cxx_flag_if_supported("-fdiagnostics-color=always" CMAKE_CXX_FLAGS)
else()
append_cxx_flag_if_supported("-fcolor-diagnostics" CMAKE_CXX_FLAGS)
endif()
endif()
append_cxx_flag_if_supported("-faligned-new" CMAKE_CXX_FLAGS)
if(WERROR)
@ -1257,7 +1248,7 @@ endif()
add_subdirectory(c10)
add_subdirectory(caffe2)
# ---[ CMake related files Uninistall option.
# ---[ CMake related files Uninstall option.
if(NOT TARGET caffe2_uninstall)
configure_file(
${CMAKE_CURRENT_SOURCE_DIR}/cmake/cmake_uninstall.cmake.in

View File

@ -7,7 +7,7 @@
# Each line is a file pattern followed by one or more owners.
# For module labels => owners mapping, please see https://github.com/pytorch/pytorch/issues/24422.
/torch/utils/cpp_extension.py @fmassa @soumith @ezyang
/torch/utils/cpp_extension.py @fmassa @ezyang @malfet
# Not there to strictly require the approval, but to be tagged as a reviewer
# on the PRs to push them into a high priority inbox.

View File

@ -228,6 +228,8 @@ dependencies as well as the nightly binaries into the repo directory.
details.
* [cuda](aten/src/ATen/native/cuda) - CUDA implementations of
operators.
* [mps](aten/src/ATen/native/mps) - MPS implementations of
operators for Apple's Metal GPU family.
* [sparse](aten/src/ATen/native/sparse) - CPU and CUDA
implementations of COO sparse tensor operations
* [mkl](aten/src/ATen/native/mkl) [mkldnn](aten/src/ATen/native/mkldnn)
@ -343,13 +345,7 @@ command runs tests such as `TestNN.test_BCELoss` and
### Local linting
Install all prerequisites by running
```bash
make setup-lint
```
You can now run the same linting steps that are used in CI locally via `make`:
You can run the same linting steps that are used in CI locally via `make`:
```bash
make lint

View File

@ -70,7 +70,7 @@ RUN /opt/conda/bin/conda install -y python=${PYTHON_VERSION}
ARG TARGETPLATFORM
# INSTALL_CHANNEL whl - release, whl/nightly - nightly, whle/test - test channels
# INSTALL_CHANNEL whl - release, whl/nightly - nightly, whl/test - test channels
RUN case ${TARGETPLATFORM} in \
"linux/arm64") pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchaudio ;; \
*) pip install --index-url https://download.pytorch.org/${INSTALL_CHANNEL}/${CUDA_PATH#.}/ torch torchvision torchaudio ;; \

View File

@ -57,19 +57,35 @@ setup-env-cuda:
setup-env-rocm:
$(MAKE) setup-env PYTHON="$(PYTHON)" NIGHTLY_TOOL_OPTS="$(NIGHTLY_TOOL_OPTS) --rocm"
.PHONY: setup-lint
setup-lint:
.lintbin/.lintrunner.sha256: requirements.txt pyproject.toml .lintrunner.toml
@echo "Setting up lintrunner..."
$(PIP) install lintrunner
lintrunner init
@echo "Generating .lintrunner.sha256..."
@mkdir -p .lintbin
@sha256sum requirements.txt pyproject.toml .lintrunner.toml > .lintbin/.lintrunner.sha256
.PHONY: setup-lint
setup-lint: .lintbin/.lintrunner.sha256
.PHONY: lazy-setup-lint
lazy-setup-lint: .lintbin/.lintrunner.sha256
@if [ ! -x "$(shell command -v lintrunner)" ]; then \
$(MAKE) setup-lint; \
fi
.PHONY: lint
lint:
lintrunner
lint: lazy-setup-lint
lintrunner --all-files
.PHONY: quicklint
quicklint:
quicklint: lazy-setup-lint
lintrunner
.PHONY: quickfix
quickfix: lazy-setup-lint
lintrunner --apply-patches
# Deprecated target aliases
.PHONY: setup_env setup_env_cuda setup_env_rocm setup_lint
setup_env: setup-env

View File

@ -384,14 +384,14 @@ with such a step.
On Linux
```bash
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
python setup.py build --cmake-only
CMAKE_ONLY=1 python setup.py build
ccmake build # or cmake-gui build
```
On macOS
```bash
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build --cmake-only
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ CMAKE_ONLY=1 python setup.py build
ccmake build # or cmake-gui build
```
@ -557,7 +557,7 @@ To learn more about making a contribution to Pytorch, please see our [Contributi
PyTorch is a community-driven project with several skillful engineers and researchers contributing to it.
PyTorch is currently maintained by [Soumith Chintala](http://soumith.ch), [Gregory Chanan](https://github.com/gchanan), [Dmytro Dzhulgakov](https://github.com/dzhulgakov), [Edward Yang](https://github.com/ezyang), and [Nikita Shulga](https://github.com/malfet) with major contributions coming from hundreds of talented individuals in various forms and means.
A non-exhaustive but growing list needs to mention: [Trevor Killeen](https://github.com/killeent), [Sasank Chilamkurthy](https://github.com/chsasank), [Sergey Zagoruyko](https://github.com/szagoruyko), [Adam Lerer](https://github.com/adamlerer), [Francisco Massa](https://github.com/fmassa), [Alykhan Tejani](https://github.com/alykhantejani), [Luca Antiga](https://github.com/lantiga), [Alban Desmaison](https://github.com/albanD), [Andreas Koepf](https://github.com/andreaskoepf), [James Bradbury](https://github.com/jekbradbury), [Zeming Lin](https://github.com/ebetica), [Yuandong Tian](https://github.com/yuandong-tian), [Guillaume Lample](https://github.com/glample), [Marat Dukhan](https://github.com/Maratyszcza), [Natalia Gimelshein](https://github.com/ngimel), [Christian Sarofeen](https://github.com/csarofeen), [Martin Raison](https://github.com/martinraison), [Edward Yang](https://github.com/ezyang), [Zachary Devito](https://github.com/zdevito).
A non-exhaustive but growing list needs to mention: [Trevor Killeen](https://github.com/killeent), [Sasank Chilamkurthy](https://github.com/chsasank), [Sergey Zagoruyko](https://github.com/szagoruyko), [Adam Lerer](https://github.com/adamlerer), [Francisco Massa](https://github.com/fmassa), [Alykhan Tejani](https://github.com/alykhantejani), [Luca Antiga](https://github.com/lantiga), [Alban Desmaison](https://github.com/albanD), [Andreas Koepf](https://github.com/andreaskoepf), [James Bradbury](https://github.com/jekbradbury), [Zeming Lin](https://github.com/ebetica), [Yuandong Tian](https://github.com/yuandong-tian), [Guillaume Lample](https://github.com/glample), [Marat Dukhan](https://github.com/Maratyszcza), [Natalia Gimelshein](https://github.com/ngimel), [Christian Sarofeen](https://github.com/csarofeen), [Martin Raison](https://github.com/martinraison), [Edward Yang](https://github.com/ezyang), [Zachary Devito](https://github.com/zdevito). <!-- codespell:ignore -->
Note: This project is unrelated to [hughperkins/pytorch](https://github.com/hughperkins/pytorch) with the same name. Hugh is a valuable contributor to the Torch community and has helped with many things Torch and PyTorch.

View File

@ -1,4 +1,4 @@
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
cmake_minimum_required(VERSION 3.27 FATAL_ERROR)
set(CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cmake ${CMAKE_MODULE_PATH})
if(NOT MSVC)
@ -169,14 +169,10 @@ file(GLOB native_transformers_hip_hip "native/transformers/hip/*.hip")
file(GLOB native_transformers_hip_cpp "native/transformers/hip/*.cpp")
file(GLOB native_quantized_cudnn_hip_cpp "native/quantized/cudnn/hip/*.cpp")
file(GLOB native_utils_cpp "native/utils/*.cpp")
# flash_attention sources
file(GLOB flash_attention_cuda_kernels_cu ${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src/*.cu)
# Flash attention C++ sources
file(GLOB flash_attention_cuda_cpp
"${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src/*.cpp"
"native/transformers/cuda/flash_attn/flash_api.cpp"
)
file(GLOB flash_attention_cuda_cpp ${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src/*.cpp)
file(GLOB native_flash_attn_api_cpp "native/transformers/cuda/flash_attn/flash_api.cpp")
# flash_attention hip sources
file(GLOB flash_attention_hip_hip "native/transformers/hip/flash_attn/*.hip")
@ -208,10 +204,29 @@ file(GLOB mem_eff_attention_cuda_cu "native/transformers/cuda/mem_eff_attention/
file(GLOB mem_eff_attention_cuda_kernels_cu "native/transformers/cuda/mem_eff_attention/kernels/*.cu")
file(GLOB mem_eff_attention_cuda_cpp "native/transformers/cuda/mem_eff_attention/*.cpp")
if(USE_CUDA AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
add_library(flash_attention OBJECT EXCLUDE_FROM_ALL ${flash_attention_cuda_kernels_cu} ${flash_attention_cuda_cpp})
target_include_directories(flash_attention PUBLIC
${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc
${PROJECT_SOURCE_DIR}/third_party/flash-attention/include
${PROJECT_SOURCE_DIR}/third_party/cutlass/include
${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src
)
target_compile_definitions(flash_attention PRIVATE
# Copied from https://github.com/pytorch/pytorch/blob/a10024d7dea47c52469059a47efe376eb20adca0/caffe2/CMakeLists.txt#L1431
FLASH_NAMESPACE=pytorch_flash
FLASHATTENTION_DISABLE_ALIBI
FLASHATTENTION_DISABLE_SOFTCAP
UNFUSE_FMA
)
set_target_properties(flash_attention PROPERTIES POSITION_INDEPENDENT_CODE ON)
endif()
if(USE_FLASH_ATTENTION)
list(APPEND native_transformers_cuda_cu ${flash_attention_cuda_cu})
list(APPEND native_transformers_cuda_cu ${flash_attention_cuda_kernels_cu})
list(APPEND native_transformers_cuda_cpp ${flash_attention_cuda_cpp})
list(APPEND native_transformers_cuda_cpp ${native_flash_attn_api_cpp})
list(APPEND FLASH_ATTENTION_CUDA_SOURCES ${flash_attention_cuda_cu} ${flash_attention_cuda_kernels_cu})
list(APPEND ATen_ATTENTION_KERNEL_SRCS ${flash_attention_cuda_kernels_cu})

View File

@ -200,7 +200,7 @@ inline at::ScalarType scalar_type(at::ScalarType s) {
switch (_st) { \
__VA_ARGS__ \
default: \
TORCH_CHECK( \
TORCH_CHECK_NOT_IMPLEMENTED( \
false, \
'"', \
at_dispatch_name, \

View File

@ -35,7 +35,7 @@ MemOverlap has_internal_overlap(TensorImpl* t) {
// SymInts. Thus, if I have u0 size, we should assume that this has > 1
// elements (first expression), but if I have a u0 stride, I should NOT
// assume that it is not zero (second expression)
if (TORCH_GUARD_SIZE_OBLIVIOUS(sizes[i].sym_gt(1)) && strides[i] == 0) {
if (TORCH_GUARD_OR_FALSE(sizes[i].sym_gt(1)) && strides[i] == 0) {
return MemOverlap::Yes;
}
}

View File

@ -108,7 +108,7 @@ void SparseTensorImpl::set_indices_and_values_unsafe(const Tensor& indices, cons
AT_ASSERT(device() == values_.device());
AT_ASSERT(values_.device() == indices_.device());
coalesced_ = TORCH_GUARD_SIZE_OBLIVIOUS(sym_nnz().sym_lt(2));
coalesced_ = TORCH_GUARD_OR_FALSE(sym_nnz().sym_lt(2));
}

View File

@ -1388,7 +1388,7 @@ bool TensorIteratorBase::fast_set_up(const TensorIteratorConfig& config) {
case FastSetupType::NON_OVERLAPPING_DENSE:
{
// find the index of a defined tensor in operands_ start from input tensor
int i_defined; // NOLINT(cppcoreguidelines-init-variables)
int i_defined = -1;
for (i_defined = ntensors() - 1; i_defined >= 0; --i_defined) {
if (tensor(i_defined).defined()) break;
}

View File

@ -168,7 +168,9 @@ class IListRefTagImpl<IListRefTag::Boxed, at::OptionalTensorRef>
*/
static IListRefConstRef<at::OptionalTensorRef> iterator_get(
const typename list_type::const_iterator& it) {
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wdangling-reference")
const auto& ivalue = (*it).get();
C10_DIAGNOSTIC_POP()
if (!ivalue.isNone()) {
const auto& tensor = ivalue.toTensor();
return (tensor.defined()) ? tensor : at::OptionalTensorRef{};

View File

@ -152,8 +152,11 @@ struct TORCH_API DispatchKeyExtractor final {
// no safe toTensorRef method, alas)
ks = ks | ivalue.unsafeToTensorImpl()->key_set();
} else if (C10_UNLIKELY(ivalue.isTensorList())) {
for (const at::Tensor& tensor : ivalue.toTensorList()) {
ks = ks | tensor.key_set();
// NB: use toListRef as it doesn't induce refcount bumps
// (toTensorListRef is not a thing)
for (const auto& nv : ivalue.toListRef()) {
auto* tensor = nv.unsafeToTensorImpl();
ks = ks | tensor->key_set();
}
}
// Tensor?[] translates to a c10::List<IValue> so we need to peek inside

View File

@ -179,6 +179,18 @@ const std::vector<OperatorName> Dispatcher::getAllOpNames() {
});
}
const std::vector<OperatorName> Dispatcher::getAllOpNamesForDispatchKey(DispatchKey k) {
return operatorLookupTable_.read([&] (const ska::flat_hash_map<OperatorName, OperatorHandle>& operatorLookupTable) -> std::vector<OperatorName> {
std::vector<OperatorName> allOpNames;
for (const auto& op : operatorLookupTable) {
if (op.second.hasKernelForDispatchKey(k)) {
allOpNames.push_back(op.first);
}
}
return allOpNames;
});
}
// Postcondition: caller is responsible for disposing of registration when they
// are done
OperatorHandle Dispatcher::findOrRegisterName_(const OperatorName& op_name) {

View File

@ -165,6 +165,10 @@ class TORCH_API Dispatcher final {
// Returns a list of all operator names present in the operatorLookupTable_
const std::vector<OperatorName> getAllOpNames();
// Returns a list of all operator names present in the operatorLookupTable_
// for a given dispatch key
const std::vector<OperatorName> getAllOpNamesForDispatchKey(DispatchKey k);
// ------------------------------------------------------------------------
//
// Invoking operators

View File

@ -213,7 +213,8 @@ OperatorEntry::AnnotatedKernelContainerIterator OperatorEntry::registerKernel(
#endif
// Suppress the warning for Meta key as we are overriding C++ meta functions with python meta functions
// for some ops
if (dispatch_key != DispatchKey::Meta) {
// Also suppress the warning for MTIA, as MTIA achieves CPU fallback by overriding registration.
if (dispatch_key != DispatchKey::Meta && dispatch_key != DispatchKey::MTIA) {
TORCH_WARN_ONCE("Warning only once for all operators, other operators may also be overridden.\n",
" Overriding a previously registered kernel for the same operator and the same dispatch key\n",
" operator: ", (schema_.has_value() ? toString(schema_->schema) : toString(name_)), "\n",
@ -353,7 +354,7 @@ std::pair<const AnnotatedKernel&, const char*> OperatorEntry::computeDispatchTab
// CompositExplicitAutogradNonFunctional > CompositeExplicitAutograd > CompositeImplicitAutograd > Autograd
// Note [CompositeExplicitAutograd and CompositeImplicitAutograd]
// When there're registrations to both CompositeExplicitAutograd & CompositeImplicitAutograd & Autograd, from (2.2) we know CompositeExplicitAutograd
// and Autograd kernels will be picked up and CompositeImplicitAutograd is overriden.
// and Autograd kernels will be picked up and CompositeImplicitAutograd is overridden.
// This is fine and in practice CompositeExplicitAutograd and CompositeImplicitAutograd shouldn't co-exist for an op.
// TODO: Update alias key precedence after we add new alias keys AutogradDispatchCPUOrCUDA .

View File

@ -1787,8 +1787,7 @@ TEST(NewOperatorRegistrationTest, dispatchAutogradPrecedence) {
}
TEST(NewOperatorRegistrationTest, throwsWhenRegisterToBackendMapsToAutogradOther) {
// NOLINTNEXTLINE(cppcoreguidelines-init-variables)
bool fpga_called, math_called = false;
bool fpga_called = false, math_called = false;
auto m = MAKE_TORCH_LIBRARY(test);
m.def("fn", torch::dispatch(c10::DispatchKey::FPGA, [&](const Tensor& x) { fpga_called = true; return x; }));
m.impl("fn", c10::DispatchKey::CompositeImplicitAutograd, [&](const Tensor& x) { math_called = true; return x; });

Some files were not shown because too many files have changed in this diff Show More