Compare commits

..

501 Commits

Author SHA1 Message Date
f355312467 Pakcage binary weights so it can also be read in c++ 2025-10-06 19:30:40 -07:00
fd0704a8ab add interface for passing in constant names using FQN names 2025-10-06 18:57:05 -07:00
4a6abba0d9 [ROCm][CI] test_convolution.py uses miopen immediate mode (#164598)
This should help stabilize some flaky test behavior where miopen would pick different solutions for different parts of the same test and the test expects bitwise identical results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164598
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 17:48:50 +00:00
96181d6f76 [BE][cutlass backend] BE changes post cutlass_cppgen name change (#164589)
Differential Revision: D83809105

Handle reviews from https://github.com/pytorch/pytorch/pull/164159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164589
Approved by: https://github.com/Skylion007
2025-10-06 17:22:08 +00:00
2164b66121 [export] Better state_dict and constant dedup in torch.export.save (#164196)
Summary:

Previously, weight deduplication was done by simply grouping tensors with their untyped storage and saving the first tensor in the group.

A more rigorous approach would be to find a complete tensor that covers the storage and store that tensor. This is particularly important for GPU weights because when saving to raw bytes, we move the weight to CPU first, and if the weight being saved is not a complete one, it will lose the storage information during the copy to CPU.

In this diff, we reuse code in `_package_weights.py` for better weights and constants deduplication in `torch.export.save`.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_weight_sharing_gpu

Differential Revision: D83523690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164196
Approved by: https://github.com/angelayi
2025-10-06 17:03:15 +00:00
bde18c445d [Max Autotune][B200] Relax absolute tolerance for MM+MM test (#164022)
Summary: Relax absolute tolerance from 1e-2 to 1e-1 for `test_non_contiguous_input_mm_plus_mm` in `test_max_autotune.py`.

Test Plan: `test_max_autotune.py`

Differential Revision: D83391942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164022
Approved by: https://github.com/eellison
2025-10-06 16:29:07 +00:00
f3e43ff2d7 [Max Autotune][B200] Fix decompose_k test failure (#164021)
Summary:
Fix decompose_k test failure (`test_max_autotune_decompose_k `) in `test_max_autotune.py` on B200s by setting `torch._inductor.config` patches for variables `comprehensive_padding` and `shape_padding`. Initial failure was `AssertionError: False is not true : Could not find a split in {3, 9, 2187, 81, 243, 729, 27} in # AOT ID: ['6_forward']`.

Refactor decompose_k test to follow patch semantics when setting all environment variables within a test.

Test Plan:
`test_max_autotune.py`:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_max_autotune_decompose_k
```

Differential Revision: D83390563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164021
Approved by: https://github.com/njriasan, https://github.com/mlazos, https://github.com/eellison
2025-10-06 16:28:23 +00:00
39d0c06ed0 [torchfuzz] check in some more xfail repros (#164619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164619
Approved by: https://github.com/ezyang
2025-10-06 16:20:44 +00:00
4ab847bbc7 Pyrefly suppressions 4/n (#164615)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/356645cf8cfe33123d9a27f23b30f7b1

after:

0 errors (2,753 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164615
Approved by: https://github.com/oulgen
2025-10-06 16:14:36 +00:00
4bd1505f84 [precompile][ez] Inline type definition for dynamo cache entry. (#164580)
Summary: as title. DynamoCaptureOutput in package.py is not actively used in other files. Inline it to reduce confusion.

Test Plan: CI

Differential Revision: D83846957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164580
Approved by: https://github.com/dolpm
2025-10-06 16:00:59 +00:00
1f9614cef8 [ROCm][CI] Change rocm periodic workflow label to linux.rocm.gpu.mi250.4 (#164616)
Testing done on this PR: https://github.com/pytorch/pytorch/pull/156491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164616
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2025-10-06 15:51:07 +00:00
35f66b83f8 respect aten planned overlap in inductor (#164569)
Now that we have a hop to add implicit deps - use those deps for comm/compute overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164569
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164568
2025-10-06 15:47:55 +00:00
4a39820e5e Add hop for additional control dependencies (#164568)
Adds [control_deps](https://en.wikipedia.org/wiki/Control_dependency) higher-order operator to enforce explicit scheduling dependencies in FX graphs. This prevents unwanted operation reordering/fusion by giving nodes additional dependencies, which we also respect in inductor by adding weakdeps on the additional dependencies.

This can be generally useful (such as for ordering collectives) but in this case I am using it so that fusions do not interfere with aten planned comm-compute overlap.

There's definitely some similarity with the `with_effects` hop. Talked with @angelayi  - when @zou3519  is back we will figure out how we want to consolidate.

The implementation needs to be a subgraph (as opposed to `with_effects`) because inductor relies on `V.graph.current_node`. Changing the signature of the node with `with_effects`  breaks this, and additionally, also breaks striding constraints on the wrapped node - see this [TODO](aed66248a0/torch/fx/experimental/proxy_tensor.py (L1246-L1249)). By maintaining the node with its original calling structure in subgraph this all works.

Example transformation:

Before:
```
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, 1), kwargs = {})
%mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%arg1_1, %arg1_1), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 2), kwargs = {})
```
After:
```
add: "f32[256, 256]" = torch.ops.aten.add.Tensor(arg0_1, 1)
mm: "f32[256, 256]" = torch.ops.higher_order.control_deps((add,), subgraph_mm, arg1_1, arg1_1)
mul: "f32[256, 256]" = torch.ops.higher_order.control_deps((mm,), subgraph_mul, add)
```

The mm operation now explicitly depends on add completing first, and mul depends on mm, with original operations preserved in subgraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164568
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
2025-10-06 15:47:55 +00:00
600267ea56 Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-10-06 14:29:07 +00:00
f11ac803d7 Update slow tests (#164726)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164726
Approved by: https://github.com/pytorchbot
2025-10-06 12:57:29 +00:00
ea42517e45 [xla hash update] update the pinned xla hash (#164727)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164727
Approved by: https://github.com/pytorchbot
2025-10-06 11:54:10 +00:00
91c211fb8c AC should work with pre-dispatch IR (#164505)
Previously we had to rely on turning off export verifier because the AC body was torch IR instead of aten IR. This PR makes it so that we create an IR that is export compatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164505
Approved by: https://github.com/ydwu4, https://github.com/xmfan
2025-10-06 11:05:22 +00:00
660e369a68 [FSDP2] check storage equal and consider data_ptr() == 0 (#164595)
resolve https://github.com/pytorch/pytorch/issues/164554

unit test
* `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict`
* `pytest -s test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta_device_1d_init`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164595
Approved by: https://github.com/fegin
2025-10-06 08:44:38 +00:00
2883b5ab77 [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-06 02:59:24 +00:00
9fff8155c3 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
331191ce4b Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 29cbcbac4215e0d9070a1b7a07ddaec9a36bbd08.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/izaitsevfb due to reverted internally, see [D83214133](https://www.internalfb.com/diff/D83214133) ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3369348172))
2025-10-05 21:39:57 +00:00
2c5ed6e7c0 Revert "[2/N] Fix clang-tidy readability checks (#164652)"
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.

Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))
2025-10-05 21:36:57 +00:00
5d7360bb03 Revert "Enable all SIM rules except disabled ones (#164645)"
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.

Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))
2025-10-05 19:32:21 +00:00
321e602692 Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
3c5ca685d6 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
5178d0a480 [Compile] Fix Compile Warning for Capture Id (#163898)
```bash
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
```

Cuda won't use 0 as a capture id, so it is safe to initialize with 0, which also matches the initialization in `pytorch/aten/src/ATen/native/cudnn/RNN.cpp:2362`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163898
Approved by: https://github.com/houseroad
2025-10-05 06:51:33 +00:00
cf0a00d4f3 Enable ruff FURB161 rule (#164654)
This PR enables FURB161 in ruff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164654
Approved by: https://github.com/Skylion007
2025-10-04 23:26:28 +00:00
5ed4270440 remove more no longer needed torch._check_is_size calls 1 (#164630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164630
Approved by: https://github.com/Skylion007
ghstack dependencies: #164627
2025-10-04 22:06:04 +00:00
8c728e129d remove no longer needed torch._check_is_size calls from test_dynamic_shapes (#164627)
No longer needed in those tests to prevent DDE

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164627
Approved by: https://github.com/ezyang
2025-10-04 22:06:04 +00:00
9fc2c6446d remove guard_size_oblivious from is_contiguous python eager eval path. (#164622)
Summary: this should not be needed anymore we shall have explicit is_contiguous_or_false calls where appropriate already !

Test Plan: run existing tests.

Differential Revision: D83884977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164622
Approved by: https://github.com/bobrenjc93
2025-10-04 21:02:39 +00:00
409aece3f9 [dynamo, 3.14] prevent StackRef compilation in 3.14 Windows (#164400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164400
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-10-04 18:38:08 +00:00
b116c51330 torch.cond on DTensor triggers an internal assert, add xfail for this. (#164389)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164389
Approved by: https://github.com/albanD
2025-10-04 18:12:06 +00:00
2e1742dd63 Revert "Add device argument to torch.random.get_rng_state (#163034)"
This reverts commit 9580539e2f73d68e89544c713ff460bea3038701.

Reverted https://github.com/pytorch/pytorch/pull/163034 on behalf of https://github.com/cyyever due to It cased partially initialised torch module ([comment](https://github.com/pytorch/pytorch/pull/163034#issuecomment-3368349209))
2025-10-04 15:25:45 +00:00
f7ad6dbad6 Numpy zerotensor handling (#164487)
Fixes #89034

Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.

@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-10-04 12:03:48 +00:00
f46bb04dcc Revert "Add pure view support in autograd Function (#164467)"
This reverts commit 10335ffb2cce26c99958d055f415a16c1d14bc35.

Reverted https://github.com/pytorch/pytorch/pull/164467 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))
2025-10-04 11:42:46 +00:00
6f6a919366 Revert "Make custom op alias check consistent (#164576)"
This reverts commit e438db254602cf39ba536aed0590b4144c019ee8.

Reverted https://github.com/pytorch/pytorch/pull/164576 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))
2025-10-04 11:42:45 +00:00
83d71dfb2f Fix mesh.get_local_rank when it is > 1d (#164473)
Previously, we would not take the arguments passed by get_local_rank into account. This means that we wouldn't be able to trace this call if we had a device_mesh > 1d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164473
Approved by: https://github.com/xmfan, https://github.com/Skylion007
2025-10-04 11:27:55 +00:00
5103ecc5d8 [1/N] Fix clang-tidy readability checks (#164561)
Check all `.cpp` files except `jit` files for readability thoroughly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561
Approved by: https://github.com/Skylion007
2025-10-04 09:40:38 +00:00
9580539e2f Add device argument to torch.random.get_rng_state (#163034)
Fixes #162812

Adds support for either passing a device directly into get_rng_state, or passing in a string or int (which is then wrapped into a device inside, as in torch.cuda.get_rng_state).

I wasn't exactly sure where tests for this should go, please let me know. I used this script for testing:
```python
import torch

# note: when running with CUDA GPU, first three tests will give the same result,
# as will the last two

# test with no device specified
print(torch.get_rng_state())

# test with CPU
cpu_device = torch.device("cpu")
print(torch.get_rng_state(cpu_device))

# test with direct name
print(torch.get_rng_state("cpu"))

# test with CUDA
cuda_device = torch.device("cuda:0")
print(torch.get_rng_state(cuda_device))

# test with integer
print(torch.get_rng_state(0))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163034
Approved by: https://github.com/ezyang, https://github.com/cyyever
2025-10-04 06:48:39 +00:00
a11a66ef32 Remove CUDA 11 branches for sparse code (#164531)
This PR removes outdated CUDA version checks from sparse code in aten/src/ATen/cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164531
Approved by: https://github.com/eqy
2025-10-04 06:07:49 +00:00
6b768e1890 Support propagating custom meta field to backward graph nodes (#164174)
# Propagate custom meta data to backward

Support propagating the user annotation tags to backward graph, by extending the `copy_fwd_metadata_to_bw_nodes` utils (recommended by @xmfan , thanks!).

Example annotation API (added in https://github.com/pytorch/pytorch/pull/163673):

```
class M(torch.nn.Module):
    def forward(self, x):
        with fx_traceback.annotate({"pp_stage": 0}):
            with fx_traceback.annotate({"fdsp_bucket": 0}):
                x = x + 1
            x = x - 2
            with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}):
                x = x * 2
        x = x / 3
        return x
```

Assumptions (some inherited from https://github.com/pytorch/pytorch/pull/126573):

- I am trusting the seq_nr mapping introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129
- I am also trusting that the forward is single threaded, since seq_nr is thread local.  If this isn't always true, we'll need to also plumb thread_id through the same machinery which is populating seq_nr.
- **(This is changed in this PR!) I assume all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes**.  If we don't run export before `aot_export_join_with_descriptors`, then none of the nodes has "nn_module_stack" in node meta. If we do run export first, then we don't need this change.
- I copy "custom" node meta from forward to backward graph nodes.

Question:
- Is it a good idea to copy all "custom" node meta? Or should we create a dedicated key in custom node meta to be copied? @SherlockNoMad
- Do we expect people to run export before using `aot_export_join_with_descriptors`?
- Can we assume the following for graph produced by `aot_export_join_with_descriptors`? "all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes". Maybe this is a question for @ezyang

```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_
python test/export/test_export.py -k preserve_anno
python test/distributed/tensor/test_dtensor_export.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164174
Approved by: https://github.com/xmfan, https://github.com/SherlockNoMad
2025-10-04 05:03:32 +00:00
35c4130fd1 [2/N] Fix ruff warnings (#164460)
Apply ruff `SIM` rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
34042a9145 Change intra-graph offset dtype to uint64_t (#164515)
Even though `offset_intragraph_` only tracks RNG consumption within a single graph replay, we have observed that the 32bit storage for these offsets is easy to overshoot, especially for cases with big CUDA graph captures including kernels that are generating a large amount of random numbers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164515
Approved by: https://github.com/eee4017, https://github.com/eqy
2025-10-04 03:39:09 +00:00
Ken
9d1ab4f4bb [CI] Limit Numba CUDA-13 patch to CUDA environments only (#164607)
The patch introduced in https://github.com/pytorch/pytorch/pull/163111 caused issues in ROCm environments. This change guards the patching logic to CUDA environments only, thus ameliorating test failures in ROCm environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164607
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 02:39:07 +00:00
3e0826c9d7 Update disabling fast-path for strict-export inside MultiheadAttention (#164544)
For some reason, executorch needs the slow path. But the original flag doesn't work for new export because we inline torch modules even before getting into make_fx. We still have to keep the old flag because lot of code assumes this exist.... grr

Differential Revision: [D83810733](https://our.internmc.facebook.com/intern/diff/D83810733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164544
Approved by: https://github.com/anijain2305, https://github.com/mikaylagawarecki
2025-10-04 02:20:55 +00:00
86c789849e [fr] Re-order mismatch check in fr analysis script (#164606)
In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following:
1. We do collective type check first
2. Then size check (excluding all2all)
3. dtype check
4. state check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606
Approved by: https://github.com/VieEeEw
2025-10-04 01:16:15 +00:00
f3afbcf340 [ONNX] Bump tested onnxruntime to 1.23.0 and onnxscript to 0.5.2 (#164440)
Performs tests on the latest ONNX environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164440
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-10-04 01:10:47 +00:00
40b25578e4 [Inductor] deterministic mode (#163589)
Add a deterministic mode to skip the on device benchmarking that we know should affect numeric. This include
- pad-mm
- dynamic rblock scaling
- template autotuning
- coordinate descent tuning for reduction
- reduction config autotuning in CachingAutotuner.  For reduction both RBLOCK, num_warps should affect numeric. XBLOCK does not. We can still autotune XBLOCK for reductions.
- benchmarking for computation communication reordering pass

The mode definitely has perf hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163589
Approved by: https://github.com/v0i0
2025-10-04 01:05:08 +00:00
412c6d28ec [ROCm][CI] additional dynamo benchmarks for inductor-periodic (#164279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164279
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 00:55:17 +00:00
7d570129e0 Fix custom autograd Function memory leak when saving mutated view (#164407)
Fixes https://github.com/pytorch/pytorch/issues/160317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164407
Approved by: https://github.com/albanD
2025-10-04 00:47:12 +00:00
97ca21106d move fw|bw compiler args in aot joint with descriptors (#164584)
Summary: Minor refactor where we push some args in the aot joint with descriptors workflow that are not used in export stage to the compile stage where they are actually used.

Test Plan: existing tests should pass

Differential Revision: D83850316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164584
Approved by: https://github.com/tugsbayasgalan
2025-10-04 00:24:46 +00:00
27234792ad Fix refine_ranges corner case (#164075)
address https://github.com/pytorch/pytorch/issues/161360

u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164075
Approved by: https://github.com/ColinPeppler
2025-10-03 23:30:46 +00:00
b6b7a44dec Fix common typos and misspellings (#164413)
Summary:
This commit fixes numerous typos and misspellings found throughout the codebase. The fixes improve code readability and documentation consistency across C++, Python, CUDA, and documentation files.

## Typos Fixed

| Before | After | Occurrences |
|--------|-------|-------------|
| occured | occurred | 14 |
| accross | across | 9 |
| lenght/lenghts | length/lengths | 8 |
| unneccessary | unnecessary | 5 |
| Peform | Perform | 4 |
| furture | future | 3 |
| paritioned | partitioned | 2 |
| desireable | desirable | 2 |
| registerations | registrations | 2 |
| seperated | separated | 2 |
| intialized | initialized | 2 |
| capatibility | compatibility | 2 |
| peformed | performed | 2 |
| Exmple | Example | 2 |
| comma_seperated | comma_separated | 2 |
| cumsuming | consuming | 2 |
| neccessary | necessary | 1 |
| ParamterMetadataTable | ParameterMetadataTable | 1 |
| matached | matched | 1 |
| conaitner | container | 1 |
| reivew | review | 1 |
| prioriry | priority | 1 |
| Alocated | Allocated | 1 |
| opportunixtically | opportunistically | 1 |
| peformance | performance | 1 |
| equavalent | equivalent | 1 |
| asssumed | assumed | 1 |
| valdiation | validation | 1 |
| apprear | appear | 1 |
| consectuve | consecutive | 1 |
| dependending | depending | 1 |
| copnversion | conversion | 1 |
| weigted | weighted | 1 |
| repreesenting | representing | 1 |
| finialize | finalize | 1 |
| unintialized | uninitialized | 1 |
| conbined | combined | 1 |
| tesnor | tensor | 1 |
| desugared | discarded | 1 |
| behaviour | behavior | 1 |
| paramerizaitons | parametrizations | 1 |
| compute_output_lenghths_kernel | compute_output_lengths_kernel | 1 |

Test Plan: N/A - mostly comments - waiting on CI

Differential Revision: D83695665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164413
Approved by: https://github.com/eqy, https://github.com/larryliu0820
2025-10-03 23:19:41 +00:00
3ddf2018d0 Revert "Support setting grad_dtype on leaf tensors (#162815)"
This reverts commit dca73982c53e9f99f96246b5d9ed9bab83c7423f.

Reverted https://github.com/pytorch/pytorch/pull/162815 on behalf of https://github.com/yangw-dev due to break internal test D83850533, see more details below ([comment](https://github.com/pytorch/pytorch/pull/162815#issuecomment-3367498501))
2025-10-03 23:14:28 +00:00
fac6f20ae3 [CI] Add another win shard (#164605)
Since its timing out 0b4f2b46d9/1

the first shard is disproportionately long because of cpp tests, I'm trying to figure that out but for now we can do this or increase the timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164605
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-10-03 22:51:09 +00:00
1894082000 UT/Examples for resharding checkpoint save/loads for distributed tensors with uneven shards. (#160533)
1\  DTensor abstraction on its own, does not support arbitrary length shards in its distributed tensors representation. It supports a single uneven shard, bit it has to be the last shard in the sharding dimension.

2\ However, DCP supports an API called checkpointable. This API allows you to define your custom shardable tensor structure. I have given a UT example ( look for CheckpointableDistTensor). Therefore, one option is to use CheckpointableDistTensor to save/load uneven shards.

3\ While exploring this path, I also noticed that torch.rec module also encountered a similar problem while working with DTensor. They workaround it by implementing Checkpointable API in DTensor and introducing an auxillary structure called LocalShardsWrapper. This is the second option we can use to unblock data loader resharding work.

In summary;
Use LocalShardWrapper + DTensor as the first option to unblock.
Second preference is to use new implementation of Checkpointable API. ( similar to CheckpointbaleDistTensor I have introduced in this example).

Differential Revision: D80182564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160533
Approved by: https://github.com/saumishr
2025-10-03 22:15:02 +00:00
5a66ff4915 [dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005
Approved by: https://github.com/anijain2305, https://github.com/atalman
2025-10-03 22:07:54 +00:00
abadea70f3 [inductor] thread hint_override in more kernel args (#164494)
ensure hint_override is threaded in benchmarking args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164494
Approved by: https://github.com/bobrenjc93
2025-10-03 22:07:12 +00:00
f414aa8e0d Add pyrefly suppressions (3/n) (#164588)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/bb31574ac8a59893c9cf52189e67bb2d

after:

 0 errors (1,970 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164588
Approved by: https://github.com/oulgen
2025-10-03 22:03:03 +00:00
e438db2546 Make custom op alias check consistent (#164576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576
Approved by: https://github.com/soulitzer
ghstack dependencies: #164467
2025-10-03 21:42:11 +00:00
10335ffb2c Add pure view support in autograd Function (#164467)
Fix https://github.com/pytorch/pytorch/issues/73604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164467
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-10-03 21:42:11 +00:00
f006aee601 Speed up FP precision lookup (#164044)
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-03 21:35:20 +00:00
8d53d788fe lint: add .pyi to changed files on .pyi.in changes (#164603)
We were observing issues where the lint on trunk vs. PRs would be different
due to missing .pyi files. This change adds the .pyi files to the changed files
list when .pyi.in files are changed.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164603
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/Skylion007
2025-10-03 21:30:54 +00:00
0b4f2b46d9 Revert "[inductor] require shape in TritonCSEVariable (#162275)"
This reverts commit f465ea6752c91498de63eb57439a74f4836e568a.

Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/yangw-dev due to break interal test, see more details in next comment ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3367213941))
2025-10-03 21:07:00 +00:00
960c4b9937 [inductor] Enable triton kernels with unbacked inputs (#164509)
Summary:
We need to pass in fallback value to avoid converting symbols to int

original failure log in onefeed Slimper MB - P1973406565
`raise TypeError("Cannot convert symbols to int")`

Test Plan:
if not passing in fallback value -
https://www.internalfb.com/intern/everpaste/?handle=GGeAoh_M11kEGOECAFELOaq8ooRCbswMAAAz
`raise TypeError("Cannot convert symbols to int")`

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_triton_kernel_with_unbacked_symint_fallback --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Buck UI: https://www.internalfb.com/buck2/4d27cd49-770b-40de-8c65-9ee04c5dd687
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149324695031
Network: Up: 0B  Down: 16MiB  (reSessionID-8e8b07a2-e31c-402d-bf6a-ebb92253e654)
Executing actions. Remaining     0/6                                                              5.0s exec time total
Command: test.     Finished 2 cache (100% hit)                                                    5.0s exec time cached (100%)
Time elapsed: 33.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D83684260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164509
Approved by: https://github.com/ColinPeppler
2025-10-03 21:05:18 +00:00
1f8ee5da11 [TorchGen] Remove unused variables and function imports (#164538)
This PR removes unused code in torchgen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164538
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-10-03 20:49:36 +00:00
da49a57d34 [ROCm] Enabled JIT UTs on ROCm (#164582)
This PR is to enable the following tests rocm.

test/test_jit.py::TestBackends::test_save_load
test/test_jit.py::TestBackends::test_execution
test/test_jit.py::TestBackends::test_errors
test/test_jit.py::TestCUDA::test_current_stream

Verified that the tests pass on AMD gfx90a and gfx942 arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164582
Approved by: https://github.com/jeffdaily
2025-10-03 20:16:41 +00:00
8ec8c14ace Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))
2025-10-03 20:15:56 +00:00
2d50678dcc Fix -Wno-duplicate-decl-specifier is valid for C/ObjC but not for C++ (#164552)
Fixes #99715
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164552
Approved by: https://github.com/Skylion007
2025-10-03 20:12:49 +00:00
3ca09d65f1 [ROCm] Enable several distributed UTs (#164390)
Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200.
    - test_data_parallel.py:test_strided_grad_layout
    - test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess

Skip for MI200:
    - test_fully_shard_training.py:test_2d_mlp_with_nd_mesh
    - test_2d_composability.py:test_train_parity_2d_mlp
    - test_fully_shard_overlap.py:test_fully_shard_training_overlap

Fixes #159489
Fixes #159488
Fixes #152700
Fixes #125555
Fixes #134139

Working as is on both MI200 and MI300:
Fixes #125991
Fixes #125918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390
Approved by: https://github.com/jeffdaily
2025-10-03 19:52:51 +00:00
1bb68271b7 Stop building nativert in OSS (#164463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164463
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-10-03 19:41:15 +00:00
9eb89a4ad5 Add missing TypeIs to torch/_inductor/ir.py (#164489)
This should be a TypeIs here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164489
Approved by: https://github.com/mlazos
2025-10-03 19:34:20 +00:00
15d726005d Enable several unit tests on ROCm (#163087)
Code change enables:
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float16
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float32
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float64
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float16
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float32
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float64
test_ops::TestCommonCUDA::test_complex_half_reference_testing_as_strided_scatter_cuda_complex32

Fixes https://github.com/pytorch/pytorch/issues/134687
Fixes https://github.com/pytorch/pytorch/issues/78205

Closing github issues:
inductor/test_gpu_cpp_wrapper unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157084

test_nn unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157167
Fixes https://github.com/pytorch/pytorch/issues/157119
Fixes https://github.com/pytorch/pytorch/issues/157118
Fixes https://github.com/pytorch/pytorch/issues/157115
Fixes https://github.com/pytorch/pytorch/issues/157081
Fixes https://github.com/pytorch/pytorch/issues/155216
Fixes https://github.com/pytorch/pytorch/issues/157259
Fixes https://github.com/pytorch/pytorch/issues/157166
Fixes https://github.com/pytorch/pytorch/issues/157165
Fixes https://github.com/pytorch/pytorch/issues/157164
Fixes https://github.com/pytorch/pytorch/issues/157117
Fixes https://github.com/pytorch/pytorch/issues/157116
Fixes https://github.com/pytorch/pytorch/issues/157114
Fixes https://github.com/pytorch/pytorch/issues/157113
Fixes https://github.com/pytorch/pytorch/issues/157082
Fixes https://github.com/pytorch/pytorch/issues/157080
Fixes https://github.com/pytorch/pytorch/issues/157079
Fixes https://github.com/pytorch/pytorch/issues/157078

test_linalg unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157427
Fixes https://github.com/pytorch/pytorch/issues/157414
Fixes https://github.com/pytorch/pytorch/issues/157369
Fixes https://github.com/pytorch/pytorch/issues/157349
Fixes https://github.com/pytorch/pytorch/issues/157348
Fixes https://github.com/pytorch/pytorch/issues/157337
Fixes https://github.com/pytorch/pytorch/issues/157336
Fixes https://github.com/pytorch/pytorch/issues/157297
Fixes https://github.com/pytorch/pytorch/issues/157281
Fixes https://github.com/pytorch/pytorch/issues/157260
Fixes https://github.com/pytorch/pytorch/issues/157171
Fixes https://github.com/pytorch/pytorch/issues/157169
Fixes https://github.com/pytorch/pytorch/issues/157168
Fixes https://github.com/pytorch/pytorch/issues/157125
Fixes https://github.com/pytorch/pytorch/issues/157124
Fixes https://github.com/pytorch/pytorch/issues/157123
Fixes https://github.com/pytorch/pytorch/issues/157089
Fixes https://github.com/pytorch/pytorch/issues/157088
Fixes https://github.com/pytorch/pytorch/issues/157087
Fixes https://github.com/pytorch/pytorch/issues/157068
Fixes https://github.com/pytorch/pytorch/issues/157067
Fixes https://github.com/pytorch/pytorch/issues/157066
Fixes https://github.com/pytorch/pytorch/issues/157047
Fixes https://github.com/pytorch/pytorch/issues/157046
Fixes https://github.com/pytorch/pytorch/issues/157045
Fixes https://github.com/pytorch/pytorch/issues/157044
Fixes https://github.com/pytorch/pytorch/issues/156997
Fixes https://github.com/pytorch/pytorch/issues/156996
Fixes https://github.com/pytorch/pytorch/issues/156995
Fixes https://github.com/pytorch/pytorch/issues/156994
Fixes https://github.com/pytorch/pytorch/issues/156993
Fixes https://github.com/pytorch/pytorch/issues/156991
Fixes https://github.com/pytorch/pytorch/issues/156990
Fixes https://github.com/pytorch/pytorch/issues/156989
Fixes https://github.com/pytorch/pytorch/issues/105118
Fixes https://github.com/pytorch/pytorch/issues/157415
Fixes https://github.com/pytorch/pytorch/issues/157282
Fixes https://github.com/pytorch/pytorch/issues/157261
Fixes https://github.com/pytorch/pytorch/issues/157170
Fixes https://github.com/pytorch/pytorch/issues/157126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-10-03 19:30:59 +00:00
16f9bef642 [precompile] Fix guard serialization loading bugs. (#164490)
Summary: Added a set of fixes triggered by fm training job. Overall the theme here is that we should get rid of saved objects as much as possible when they are not used in guard reconstruction. Sometimes for objects that cannot be saved (like local functions) we still try our best to save their closures.

Test Plan:
test_guard_serialization.py
test_lazy_awatiable.py

Differential Revision: D83766926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164490
Approved by: https://github.com/jamesjwu
2025-10-03 19:20:07 +00:00
3c59351c6e [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-03 18:59:12 +00:00
7eb1eb4313 ci: Removing ROCm tests from trunk. (#164585)
Had a conversation with the AMD team today and I think we are all in
agreement that the current state of queueing for AMD is beyond where
we'd like to be for there to be blocking CI for ROCm.

Moving the representative testing jobs for this into the ciflow/rocm
workflow.

I'd love for these to be back in trunk if we can get to a state where
our queueing metrics are below an hour for ROCm infrastructure.

Dashboards:
* ROCm Queueing (>60mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A06%3A45.025Z&endDate=2025-10-03T16%3A06%3A45.025Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=linux.rocm.gpu.2&machineTypes=linux.rocm.gpu.4&machineTypes=linux.rocm.gpu.mi250&machineTypes=linux.rocm.gpu.gfx942.1&machineTypes=linux.rocm.gpu.mi250.4&machineTypes=linux.rocm.gpu.gfx942.4&machineTypes=linux.rocm.gpu.mi355.2&machineTypes=linux.rocm.gpu.gfx942.4.test&machineTypes=linux.rocm.gpu.mi250.1&machineTypes=linux.rocm.gpu.gfx942.1.test&machineTypes=linux.rocm.gpu.gfx90a.1&machineTypes=linux.rocm.gpu.gfx90a.4&items=linux.rocm.gpu.2&items=linux.rocm.gpu.4&items=linux.rocm.gpu.mi250&items=linux.rocm.gpu.gfx942.1&items=linux.rocm.gpu.mi250.4&items=linux.rocm.gpu.gfx942.4&items=linux.rocm.gpu.mi355.2&items=linux.rocm.gpu.gfx942.4.test&items=linux.rocm.gpu.mi250.1&items=linux.rocm.gpu.gfx942.1.test&items=linux.rocm.gpu.gfx90a.1&items=linux.rocm.gpu.gfx90a.4))
* NVIDIA queueing (<5mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A05%3A08.000Z&endDate=2025-10-03T16%3A05%3A08.000Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=lf.linux.g4dn.4xlarge.nvidia.gpu&machineTypes=linux.g4dn.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.metal.nvidia.gpu&machineTypes=linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g4dn.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.4xlarge.nvidia.gpu&machineTypes=linux.g5.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.4xlarge.nvidia.gpu&machineTypes=lf.linux.4xlarge.nvidia.gpu&machineTypes=linux.g6.12xlarge.nvidia.gpu&items=lf.linux.g4dn.4xlarge.nvidia.gpu&items=linux.g4dn.12xlarge.nvidia.gpu&items=linux.g4dn.metal.nvidia.gpu&items=linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g4dn.12xlarge.nvidia.gpu&items=lf.linux.g5.12xlarge.nvidia.gpu&items=lf.linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.4xlarge.nvidia.gpu&items=linux.g5.12xlarge.nvidia.gpu&items=linux.g4dn.4xlarge.nvidia.gpu&items=lf.linux.4xlarge.nvidia.gpu&items=linux.g6.12xlarge.nvidia.gpu))

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164585
Approved by: https://github.com/malfet, https://github.com/yangw-dev, https://github.com/atalman, https://github.com/jeffdaily
2025-10-03 18:19:24 +00:00
f39789cdab [PyTorch Pinned Allocator] Add support of reserved pinned memory segment to avoid slow paths (#164501)
Summary:
This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths).

Example:

PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048

This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called.

Differential Revision: D83779074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164501
Approved by: https://github.com/yangw-dev
2025-10-03 18:11:27 +00:00
3d9d41c801 Remove old workaround in launch_logcumsumexp_cuda_kernel (#164567)
Remove workaround for CUDA 11.4 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164567
Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007
2025-10-03 18:07:02 +00:00
5b0b4cda4a [dtensor] avoid shape recompilations on DTensorSpec (#163820)
skips DTensorSpec.sizes/strides in metadata guard checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163820
Approved by: https://github.com/azahed98
2025-10-03 17:18:18 +00:00
2a11ce2c78 Support calling torch.compile inside non-strict export (#164171)
So this fixes at least two issues:
1) When we are invoking inductor backend, we apply pre-grad passes which try to find correct fake mode to use. In the nested case, we will run into clash when there is closure variable in the inductor region because non-strict would have fakified this variable before hand and inner torch.compile would have created a new fresh fake mode. This is not a problem in regular torch.compile because inner torch.compile gets ignored. I don't know if we are supposed to inherit fake mode from parent context in this case. But we can avoid this problem if we just default to eager backend which is fine in this case because the point of export is to capture aten operators. Going to inductor would mean we will lose inner torch.compile ops.
2) There is custom torch function modes in export that track number of torch fns executed and inner compile itself doesn't work because of guard failure as this mode state gets changed. I noticed torch.cond fixes this problem by carefully stashing the torch function mode and defer it in the backend. So the correct thing to do here is just re-use torch.cond implementation unconditionally.

So the things i did for fixing above were:
1) Always default to eager backend when compile is invoked inside export. I needed to make how torch.cond sets up the fresh tracing env into an util that can be shared.
2) The previous eager backend for torch.cond was wrong because the context managers didn't actually persist until the backend is invoked.
3) torch.cond used only disable TorchFunctionMetadata tf mode and stash it for later, but in fact, we should do both TorchFunctionMetadata and PreDispatchTorchFunctionMode.

With above fixes, we are able to export flex attention in export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164171
Approved by: https://github.com/ydwu4
2025-10-03 16:31:07 +00:00
3288fbf374 Change default device to current acclerator (#164399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164399
Approved by: https://github.com/albanD
2025-10-03 16:15:09 +00:00
fa5306b4f5 Support partial _DynamoCacheEntries when not all backends available (#163521)
Differential Revision: [D82735769](https://our.internmc.facebook.com/intern/diff/D82735769/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163521
Approved by: https://github.com/zhxchen17
2025-10-03 16:14:32 +00:00
5656d45c8f forward fix #164481 (#164578)
PR #164481 added unit test test_scaled_mm_preserves_strides in test/inductor/test_fp8.py. It was missing the adjustment for ROCm's F8 types on MI300.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164578
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-03 15:44:34 +00:00
e40fe634b1 Pin conda version for Docker builds (#164575)
Mitigates https://github.com/pytorch/pytorch/issues/164574
Remove unused CUDA_CHANNEL var - this was used before when we had  pytorch install via conda.

Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164575
Approved by: https://github.com/malfet, https://github.com/Camyll
2025-10-03 15:01:35 +00:00
3db2164341 [torchfuzz] add norm operators (#164514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164514
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434
2025-10-03 14:44:19 +00:00
5bb8f04d3e [torchfuzz] add nn functional ops (#164434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164434
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432
2025-10-03 14:44:19 +00:00
5743d731c1 Use torch.testing.test_close instead of torch.testing.test_allclose (#164539)
Because torch.testing.test_allclose is deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164539
Approved by: https://github.com/mlazos
2025-10-03 14:39:10 +00:00
aed66248a0 [vllm hash update] update the pinned vllm hash (#164319)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164319
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-03 12:30:33 +00:00
6c3c9414eb config for dcache + unit tests (#164512)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164512
Approved by: https://github.com/jananisriram
2025-10-03 10:52:59 +00:00
eccf561326 Move call to output generated code in inductor (#161615)
This PR moves the call to copy the generated code from `/tmp/...` so that it is still called if attempting to compile the generated code fails. In both cases now, the generated code will be copied across to `torch_compile_debug/run_.../torchinductor/output_code.py` which makes debugging bad generated code easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161615
Approved by: https://github.com/eellison
2025-10-03 10:23:22 +00:00
ddf8de28c2 Add Rocm to Operator Microbenchmark CI (#164173)
This pull request adds support for running operator microbenchmarks on ROCm (AMD GPU) environments in the CI workflow. The main changes involve introducing new build and test jobs for ROCm in the `.github/workflows/operator_microbenchmark.yml` file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164173
Approved by: https://github.com/huydhn
2025-10-03 07:35:32 +00:00
7617b113ad [torchfuzz] Support EagerVsFullGraphDynamicCompileWithNumericsCheck (#164432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164432
Approved by: https://github.com/pianpwk
2025-10-03 06:42:20 +00:00
2a760dc51e [DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213)
We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.

Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).

Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.

<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />

The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.

With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.

This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-03 05:51:28 +00:00
6c209bfc5c [cutlass-4][take 2] upgrade to cutlass 4.2.1 (#164159)
Test Plan: Sandcastle

Differential Revision: D83492704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164159
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-10-03 03:47:59 +00:00
1051c1de5c Add pyrefly suppressions 2/n (#164513)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

---
step 1: uncomment lines in the `pyrefly.toml` file
before: https://gist.github.com/maggiemoss/911b4d0bc88bf8cf3ab91f67184e9d46

after:
```
 INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml`
 INFO 0 errors (1,152 ignored)
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164513
Approved by: https://github.com/oulgen
2025-10-03 02:46:13 +00:00
d1cbb74fb1 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-03 02:41:10 +00:00
91c4db76cb fix flex attention eager: dont round down scores to low-precision (closes #163588) (#163986)
Fixes: https://github.com/pytorch/pytorch/issues/163588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163986
Approved by: https://github.com/drisspg, https://github.com/mlazos
2025-10-03 01:09:59 +00:00
4691fe6070 remove unnecessary registration (#164481)
scaled_mm already had `needs_exact_strides` in its op registration. also added a test showing these strides are being respected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164481
Approved by: https://github.com/drisspg, https://github.com/mlazos
2025-10-03 01:03:12 +00:00
ef50c6e3e3 [MPS] Add backward pass for embedding_bag (#163931)
Fixes #162270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163931
Approved by: https://github.com/malfet
2025-10-03 00:48:38 +00:00
86474ce996 Update mask dtype (#164472)
Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472
Approved by: https://github.com/bdhirsh
2025-10-03 00:19:36 +00:00
18e18488e8 [6/N] Apply ruff UP035 rule (#164438)
Continued code migration to enable ruff UP035. Most changes are about moving `Callable` from typing to from collections.abc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164438
Approved by: https://github.com/ezyang
2025-10-03 00:15:32 +00:00
f7082e92b3 [cuBLAS] update cuBLAS determinism docs, remove workspace requirement checks (#161749)
Since CUDA 11.x (need to update the docs for this, current PR is saying 12.2 which is incorrect) we've been allocating cuBLAS workspaces explicitly per handle/stream combination https://github.com/pytorch/pytorch/pull/85447

According to the cuBLAS documentation, this appears to be sufficient for determinism without any explicit workspace requirements to e.g., `:4096:8` or `:16:8` as was previously expressed in PyTorch docs https://docs.nvidia.com/cuda/cublas/#results-reproducibility

Planning to add an explicit determinism test as well...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161749
Approved by: https://github.com/ngimel
2025-10-03 00:09:47 +00:00
95a053284c Fix vllm build issue (#164361)
Fixes #ISSUE_NUMBER
unstable https://github.com/pytorch/pytorch/issues/164362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164361
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-02 23:34:21 +00:00
c7e30ae4dd MX: Remove redundant PLATFORM_SUPPORTS_MX_GEMM constant (#164320)
Deleted duplicate definition of PLATFORM_SUPPORTS_MX_GEMM, was introduced in https://github.com/pytorch/pytorch/pull/162209
Also, adjusted BLOCK_SIZE and fp4_scaling_dtype in test_matmul_cuda.py to enable test_blockwise_nvfp4_compile on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164320
Approved by: https://github.com/jeffdaily
2025-10-02 23:30:56 +00:00
dca73982c5 Support setting grad_dtype on leaf tensors (#162815)
`grad_dtype` is a new attribute on Tensor to control gradient dtype:
- Access/setting is leaf-only.
- grad_dtype is respected when (1) when assigning to .grad, and (2) in the engine after the previous node produces incoming gradients for AccumulateGrad. (See table below for details)
- Not setting grad_dtype preserves the current behavior. Accessing it returns `t.dtype`
- `grad_dtype` cannot be set when there is already a `.grad` present and the dtypes conflict.

| `grad_dtype` setting | Setting `.grad` manually | Incoming gradient from autograd engine |
|-----------------------|--------------------------|-----------------------------------------|
| **Default (tensor’s dtype)** | `.grad` must match tensor’s dtype | Engine casts incoming grad to tensor’s dtype |
| **Set to specific dtype** | `.grad` must match that dtype | Engine casts incoming grad to the specified dtype |
| **Set to `None`** | `.grad` may be any dtype | Engine does not cast; accepts incoming grad dtype as-is |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162815
Approved by: https://github.com/albanD
2025-10-02 23:09:07 +00:00
43848b71d9 Improved support for autotuning in wrapper_fxir (#164132)
Summary:
- correct dtype propagation
- allow more more options to be passed to compiler

Test Plan: in follow up change

Differential Revision: D83367909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164132
Approved by: https://github.com/jansel
2025-10-02 22:54:22 +00:00
15c8bdcc5e Fix FloorDiv should not generate non integer rationals (due to sympy bug) (#164398)
FloorDiv eval have this optimization
```
  # Expands (x + y) // b into x // b + y // b.
  # This only works if floor is an identity, i.e. x / b is an integer.
 ```

 Before this PR this optimization would generate a result in an expression like this. Duo to a bug in sympy.
 ```
Mul(Rational(1, 22), Add(Mul(Integer(24), Symbol('s37', integer=True, positive=True)), Integer(672)), FloorDiv(Mul(Symbol('s14', integer=True, positive=True), Symbol('s46', integer=True, positive=True)), Integer(2016)))
 ```

 This is because in sympy an expression can have .is_integer =True yet have 1/22 in it!
 This PR ensure we do not generate that by simply opting out if this optimization if we end
 up with quotient that have such rational.

  Fix
  https://github.com/pytorch/pytorch/issues/164385,
  https://github.com/pytorch/pytorch/issues/154996
  https://github.com/pytorch/pytorch/issues/153375
  https://github.com/pytorch/pytorch/issues/164063
and internal user issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164398
Approved by: https://github.com/jansel, https://github.com/isuruf
2025-10-02 22:51:03 +00:00
22e219d996 Revert "[DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213)"
This reverts commit b0985144b59db8fb20964829b5e0a9d2f9a3f0d6.

Reverted https://github.com/pytorch/pytorch/pull/163213 on behalf of https://github.com/yangw-dev due to caused internal test failure ([comment](https://github.com/pytorch/pytorch/pull/163213#issuecomment-3363414435))
2025-10-02 22:22:26 +00:00
bdc0a421d7 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-10-02 22:22:04 +00:00
ece5e0f01b Fake process group Direct construction error (#163665)
Fixes #162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.

Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations.

Please let me know if additional distributed operators should be tested or if any other updates are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665
Approved by: https://github.com/ezyang
2025-10-02 22:19:26 +00:00
a34797e031 Revert "Add provenance to inductor IR nodes created after graph.run (#164255)"
This reverts commit b9e73e639e36f3aa628752161711e68878231b30.

Reverted https://github.com/pytorch/pytorch/pull/164255 on behalf of https://github.com/jeffdaily due to broke rocm; inductor/test_provenance_tracing.py::TestProvenanceTracingStackTraces::test_deferred_triton_kernels [GH job link](https://github.com/pytorch/pytorch/actions/runs/18200790301/job/51821738132) [HUD commit link](b9e73e639e) ([comment](https://github.com/pytorch/pytorch/pull/164255#issuecomment-3363360088))
2025-10-02 22:01:41 +00:00
f465ea6752 [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
ghstack dependencies: #164158
2025-10-02 21:52:09 +00:00
a8edccfbf4 [inductor] fix TestTemplateRender in select_algorithm (#164158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164158
Approved by: https://github.com/mlazos
2025-10-02 21:52:09 +00:00
6389658ec6 Fix type hints in PrepareModuleInput and PrepareModuleInputOutput (#164482)
Fixes #161646

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164482
Approved by: https://github.com/Skylion007
2025-10-02 21:40:43 +00:00
cc71ab86a6 [DTensor] raise error if the local_tensor argument passed to DTensor.from_local is a DTensor (#164496)
**Summary**
Raise error when the `local_tensor` argument passed to `DTensor.from_local` is
a DTensor, this prevents users from accidentally calling `from_local` over a DTensor
object.

The error message is organized in this way:
```
the local_tensor argument only accepts torch.Tensor but got <class 'torch.distributed.tensor.DTensor'> value.
```

**Test**
`pytest test/distributed/tensor/test_dtensor.py -k test_from_local`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164496
Approved by: https://github.com/ezyang
2025-10-02 21:25:01 +00:00
2a7c486750 Revert "Speed up FP precision lookup (#164044)"
This reverts commit 723ba213932bb1eca90109e003250ebb0da45eb1.

Reverted https://github.com/pytorch/pytorch/pull/164044 on behalf of https://github.com/yangw-dev due to broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 | return std::hash<size_t>{}((k1 << 32) | k2); ([comment](https://github.com/pytorch/pytorch/pull/164044#issuecomment-3363016702))
2025-10-02 21:00:44 +00:00
5f18f240de Add initial suppressions for pyrefly (#164177)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
`python3 scripts/lintrunner.py`
`pyrefly check`

---

Pyrefly check before: https://gist.github.com/maggiemoss/3a0aa0b6cdda0e449cd5743d5fce2c60
After:

```
 INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml`
 INFO 0 errors (1,063 ignored)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164177
Approved by: https://github.com/Lucaskabela
2025-10-02 20:57:41 +00:00
6b7970192f [ROCm][CI] fix test_cudnn_convolution_relu_cuda (#164466)
Fixes #162816.
Test was comparing output of conv vs fused conv but inputs were different memory formats. Also fix test_cudnn_convolution_add_relu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164466
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-02 20:36:54 +00:00
115af42e9d Fix readibility checks in TIDY and apply them (#164475)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475
Approved by: https://github.com/albanD, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-02 20:34:49 +00:00
5f775bdfb7 Fix THP_PyObject_VirtualFree return type (#163763)
# Motivation
`void THP_PyObject_VirtualFree` should have no return value; otherwise, it would raise a build warning
```bash
C:\Users\guangyey\pytorch\torch\csrc\dynamo\cpython_defs.c(264): warning C4098: 'THP_PyObject_VirtualFree': 'void' function returning a value
```
# Additional Context
Refer to
c4f21d7c7c/Include/cpython/objimpl.h (L59-L68)
PyObjectArenaAllocator::free is defined with `void` return type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163763
Approved by: https://github.com/albanD, https://github.com/williamwen42
2025-10-02 20:21:53 +00:00
8c54101933 add tensor subclass printing support in fx/graph.py (#164403)
it was previously quite misleading since it looks like the inputs to the
dynamo graph are plain tensors when in reality they are tensor subclasses

before
```
class GraphModule(torch.nn.Module):
    def forward(self, L_input_batch_inputs_: "i64[2, 512][512, 1]cuda:0", L_self_parameters_weight_: "f32[202048, 256][256, 1]cuda:0"):
```

after
```
    class GraphModule(torch.nn.Module):
        def forward(self, L_input_batch_inputs_: "DTensor(i64[2, 512][512, 1]cuda:0)", L_self_parameters_weight_: "DTensor(f32[202048, 256][256, 1]cuda:0)"):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164403
Approved by: https://github.com/ezyang
2025-10-02 20:06:12 +00:00
c45d56dd00 typo corrected in ivalue.cpp's comment (#164485)
Fixes #164483

typo corrected in ivalue.cpp's comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164485
Approved by: https://github.com/Skylion007
2025-10-02 20:01:17 +00:00
33b17bc619 Remove old CUDA version checks (#164199)
Remove some version check code for CUDA <12.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164199
Approved by: https://github.com/ezyang
2025-10-02 19:55:47 +00:00
22b1710252 Use posix_fallocate() to reserve disk space for shared memory (#161910)
Shared memory is allocated by creating a file in /dev/shm (by default) that can run out of space. Pytorch reserves the file size by calling ftruncate() that creates a sparse file, so it succeeds even if sufficient disk space is not available.

This could lead to a situation when a shared memory region is successfully created but a subsequent access to a shared memory page results in SIGBUS due to the disk being full.

Using posix_fallocate() instead of ftruncate() eliminates this problem because the former syscall always allocates space and it returns an error if the disk is full.

Related to https://github.com/pytorch/pytorch/issues/5040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161910
Approved by: https://github.com/mikaylagawarecki
2025-10-02 19:12:57 +00:00
4661200125 [RELAND v2] Close some sources of fake tensors (#164372)
Changelog:

1. When we run into an operation we didn't proxy, we end up emitting fake constants. We error under a config and we disable the config for some internal users. The reason we want to error is this signals a coverage problem we need to address but at the same time, we don't wnat to be disruptive to already working flows.

2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict and some torchbench models like levit_128 and demucs.

3. Previous logic didn't work on the cases where we mutate a container attribute as the previous approach used to pytree over old and new attributes resulting in length mismatch. We gracefully handle this now.

Differential Revision: [D83673054](https://our.internmc.facebook.com/intern/diff/D83673054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164372
Approved by: https://github.com/avikchaudhuri
2025-10-02 18:58:52 +00:00
6a31f42da4 Fix NestedTensor max/min operations for integer dtypes. (#162273)
Fixes: https://github.com/pytorch/pytorch/issues/162049

### Summary

max_dim and min_dim functions incorrectly used torch.finfo()
for all dtypes, causing TypeError for integer tensors.

### Changes

- Use torch.iinfo() for integer dtypes instead of torch.finfo().
- Add CPU test: `test_jagged_max_min_dtypes` covering `int8, int16, int32, int64, uint8, float16, bfloat16, float32 and float64`

### Testing

Before Fix:

`python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v`

Output:

```
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
```

After Fix:

`python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v`

Output:

```
Running 9 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 PASSED [0.0086s]                                                                                                                   [ 11%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 PASSED [0.0011s]                                                                                                                    [ 22%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 PASSED [0.0011s]                                                                                                                    [ 33%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 PASSED [0.0011s]                                                                                                                    [ 44%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 PASSED [0.0009s]                                                                                                                      [ 55%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 PASSED [0.0010s]                                                                                                                      [ 66%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 PASSED [0.0010s]                                                                                                                      [ 77%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 PASSED [0.0010s]                                                                                                                       [ 88%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 PASSED [0.0011s]                                                                                                                       [100%]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162273
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
2025-10-02 18:46:27 +00:00
c6a6c80a73 Add Aidyn-A to CUDA codeowners (#164436)
Adding myself to "CUDA and CUDA math libraries" section.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164436
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2025-10-02 18:34:10 +00:00
bf717ce346 [AOTI win] Add ABI stable method for updating constant buffer (#163819)
Add `struct AOTInductorConstantMapEntry` to represent the constant map in AOTI Model. We cannot use `std::unordered_map` for cross-compilation, because it is not ABI stable.

it will be tested when we test `update_user_managed_constant_buffer` for windows cross-compilation

Example usage:

```
        // Load constants. Create random constants here.
        auto* fc1_w = new slim::SlimTensor(slim::empty({16, 10}, c10::kFloat, c10::Device(c10::kCUDA, 0)));
        fc1_w->fill_(1.0);

.....

        // Build pairs
        std::vector<AOTInductorConstantPair> constants{
            {"fc1_weight", fc1_w},
            {"fc1_bias",   fc1_b},
            {"fc2_weight", fc2_w},
            {"fc2_bias",   fc2_b},
        };

        // Call runtime (pass raw pointer + size)
        update_user_managed_constant_buffer_abi(
            container_handle,
            constants.data(),
            constants.size(),
            /*use_inactive=*/false,
            /*validate_full_update=*/true);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163819
Approved by: https://github.com/desertfire
2025-10-02 18:31:00 +00:00
f6f7676756 Revert "C++-accessible Placements via pybind11 (#163030)"
This reverts commit 3e03deab6f3c268c85c8efd9546e28cdda0fa4cc.

Reverted https://github.com/pytorch/pytorch/pull/163030 on behalf of https://github.com/swolchok due to doesn't pass pyre ([comment](https://github.com/pytorch/pytorch/pull/163030#issuecomment-3362450379))
2025-10-02 18:25:24 +00:00
e6d4b26776 Update torch.rst (#164408)
Corrected grammatical mistake

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164408
Approved by: https://github.com/mikaylagawarecki
2025-10-02 18:12:47 +00:00
6bb021c125 Revert "Use TMA loads always for Triton grouped MM kernel (#164256)"
This reverts commit b1033789fea2bc82901eafed498a5252985b80e9.

Reverted https://github.com/pytorch/pytorch/pull/164256 on behalf of https://github.com/yangw-dev due to  failed internal test: (pytorch.tritonbench.test.test_gpu.main.TestTritonbenchGpu) Error Details: torch._inductor.exc.InductorError: LoweringException: NoValidChoicesError: No choices to select. Provided reason: All choices failed to compile for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice.  ([comment](https://github.com/pytorch/pytorch/pull/164256#issuecomment-3362359624))
2025-10-02 17:55:37 +00:00
b9e73e639e Add provenance to inductor IR nodes created after graph.run (#164255)
Summary:
as title

- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.

- Extract `assign_origin_node` function

Differential Revision: D82871244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164255
Approved by: https://github.com/mlazos
2025-10-02 17:32:46 +00:00
0319556a35 Revert "[vision hash update] update the pinned vision hash (#154694)"
This reverts commit bcafea5c92ca2ee1b0dc8f6d8b62ecabb6f40228.

Reverted https://github.com/pytorch/pytorch/pull/154694 on behalf of https://github.com/yangw-dev due to break the unittest for inductor with improved, update benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv, see failure example https://github.com/pytorch/pytorch/actions/runs/18185852421/job/51776537817 ([comment](https://github.com/pytorch/pytorch/pull/154694#issuecomment-3362285901))
2025-10-02 17:32:04 +00:00
f4cf75688f Add CUDA release architecture matrix (#164471)
We should surface the CUDA architecture matrix to make things more transparent. I believe this can later become its own page where we will publish supported matrix for each release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164471
Approved by: https://github.com/Camyll
2025-10-02 16:59:48 +00:00
39189592fd Revert "Stop parsing command line arguments every time common_utils is imported. (#156703)"
This reverts commit ac7b4e7fe4d233dcd7f6343d42b4fa3d64bce548.

Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/clee2000 due to failing internally D80206253, see above comment for details ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3362156908))
2025-10-02 16:54:22 +00:00
235b995ce1 Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164470)
I believe ``set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0`` is the one thats actually used. Hence remove 6.1  to align the support with Linux support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164470
Approved by: https://github.com/tinglvv, https://github.com/nWEIdia, https://github.com/Camyll
2025-10-02 16:34:42 +00:00
ac7b4e7fe4 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-10-02 15:48:47 +00:00
c6329524d8 Revert "Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527)"
This reverts commit 50c0550f5a5b1e35885d892081a7d5115d8b4489.

Reverted https://github.com/pytorch/pytorch/pull/163527 on behalf of https://github.com/swolchok due to breaking import torch in debug builds, see #164297 ([comment](https://github.com/pytorch/pytorch/pull/163527#issuecomment-3361919142))
2025-10-02 15:42:42 +00:00
b0985144b5 [DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213)
We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.

Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).

Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.

<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />

The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.

With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.

This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-02 15:42:03 +00:00
7cfecd76b2 Revert "Improve repeat op to a single copy (#163842)"
This reverts commit 590224f83c8d575b52c6bc40a984132fa593256e.

Reverted https://github.com/pytorch/pytorch/pull/163842 on behalf of https://github.com/yangw-dev due to internal test failed: RuntimeError: false INTERNAL ASSERT FAILED at aten/src/ATen/quantized/Quantizer.cpp:441, . cannot call qscheme on UnknownQuantizer please reach out folks who have internal access for furthur debugging. ([comment](https://github.com/pytorch/pytorch/pull/163842#issuecomment-3361746041))
2025-10-02 15:22:19 +00:00
bac0f289a3 Add methods to access data and unpack_hook on SavedVariable (#164358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164358
Approved by: https://github.com/albanD
2025-10-02 13:05:16 +00:00
39c340ec9e Add failing bitwise equivalence UT for aot_eager on rms_norm (#164280)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164280
Approved by: https://github.com/albanD
2025-10-02 09:05:28 +00:00
cfd46d13e6 Fix SAC + Flex issue (#164421)
# Summary

This happends when flex_attention is not tagged with the ` CheckpointPolicy.MUST_SAVE` policy. This causes the lse to be unrealized. I think in general this probably not the best policy but we shoudn't error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164421
Approved by: https://github.com/Skylion007
2025-10-02 09:02:17 +00:00
0e5773b7fa [dynamo][export] Do not graph break on torch.autograd._profiler_enabled for export (#164418)
Actually we would like to not graph break even in the case of Dynamo. But there is a weird-unsolved bug with Kineto + Dynamo when there are distributed jobs that lead to NCCL timeouts. This bug is a rare edege case, but we have not been able to root cause it yet.

But for export, we do not anticipate JIT tracing in distributed job training and therefore this PR is safe for export.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164418
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-10-02 09:00:00 +00:00
2c2e1268b7 [inductor] Handle patterns where input/output nodes are the same (#163994)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163994
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-10-02 08:37:55 +00:00
00f0365b95 [torchfuzz] add test suite of fuzzer repros that we xfail (#164430)
i'll add the rest of the repros once in a follow up PR once we agree on a good test harness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164430
Approved by: https://github.com/ezyang
2025-10-02 08:05:11 +00:00
6bb586eafd [PyTorch / Sigrid GPU] Fixes in pinned stats collection and add new ODS pinned memory stats (#164412)
We do some fixes in pinned memory allocation stats collection and better differentiate between active vs allocated bytes.
Reviewed By: bbus, sayitmemory

Differential Revision: D83162346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164412
Approved by: https://github.com/mradmila
2025-10-02 08:04:05 +00:00
9697a7ce9e Better path handling for nightly setup tool (#164215)
Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3349283789, cc @filipviz

Previously, the `checkout` subcommand would reuse the `venv`, while the `pull` subcommand would remove and recreate a fresh new `venv` (without prompting before deleting).

This PR:

- Keep and reuse the existing `venv` by default (both `pull` and `checkout`).
- Add a new `--fresh` option to delete and recreate a fresh new `venv`.
- Prompt the user for confirmation (add a new `--yes` option) before deleting the existing prefix path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164215
Approved by: https://github.com/ezyang, https://github.com/malfet
ghstack dependencies: #162324, #164214
2025-10-02 07:59:17 +00:00
27eb36debb DebugMode add ignore_compile_internals (#164205)
Fixes #164143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164205
Approved by: https://github.com/albanD
2025-10-02 07:39:54 +00:00
a43c4c3972 [5/N] Apply ruff UP035 rule (#164423)
Continued code migration to enable ruff `UP035`. Most changes are about moving `Callable` from `typing` to `from collections.abc`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164423
Approved by: https://github.com/ezyang
2025-10-02 07:31:11 +00:00
bcafea5c92 [vision hash update] update the pinned vision hash (#154694)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154694
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-02 07:02:40 +00:00
3924f784ba unbacked reshape_copy (#164336)
address https://github.com/pytorch/pytorch/issues/162110
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164336
Approved by: https://github.com/ColinPeppler
2025-10-02 06:50:48 +00:00
93e833de0f [inductor] separate preamble from main work in compile_fx (#164169)
A couple minor things to clean up the structure of `compile_fx` before we hit pre grad passes:
1. After patching config and recursively calling `compile_fx`, we don't need the patches any more. We make the subsequent logic call a `_maybe_wrap_and_compile_fx_main` (both when cpp wrapper exists and doesn't).
2. There's some recursive wrapping that happens on inputs and outputs before hitting pre grad passes, which are now also separated out before calling a `_compile_fx_main`, where actual work finally happens.

These also happen to fix a couple of TODOs in the old code.

Differential Revision: D83500704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164169
Approved by: https://github.com/zhxchen17
2025-10-02 05:44:31 +00:00
14791ea947 [inductor] teach bisector to look at pre_grad passes (#164250)
Bisector was not aware of pre-grad passes. Now that pre-grad passes use their own graph transformer observer subsystem, it is possible to disable these passes in the bisector.

Differential Revision: D83573614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164250
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-10-02 05:42:18 +00:00
702f6e703b [MTIA] Enable deserialization for FP8 checkpoint loading (#163559)
Summary: It looks like loading FP8 checkpoints goes through that path which wasn't enabled for MTIA beforehand, whereas loading BF16 checkpoints didn't.

Differential Revision: D82997140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163559
Approved by: https://github.com/mikaylagawarecki
2025-10-02 04:18:46 +00:00
39b31a6bfd [torchfuzz] keep track of operator stats (#164334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164334
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211, #164210, #164397, #164284
2025-10-02 03:48:07 +00:00
0fbe3f19c7 [torchfuzz] add matmuls (#164284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164284
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211, #164210, #164397
2025-10-02 03:33:10 +00:00
144378615a [torchfuzz] make fuzzer deterministic (#164397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164397
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211, #164210
2025-10-02 03:10:30 +00:00
5dbae1eae2 Fix unbacked replacement where LHS is purely backed expr and RHS is unbacked expr (#164013)
## Scenario
- If there's a `torch._check(backed_expr == unbacked_symbol)`
- then we should replace unbacked_symbol for backed_expr
- currently, we don't do that when generating inputs for autotune_at_compile_time

## Error traceback
```
$ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile
    dynamo_utils.preserve_rng_state(),
  File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state
    torch.cuda.set_rng_state(cuda_rng_state)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state
    _lazy_call(cb)
  File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call
    callable()
  File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb
    default_generator.set_state(new_state)
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
```

## Bad autotuning input generation
```
# assume unbacked_symint_fallback = 16
# we generate too small of an input (16)
buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256))
triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0)

stream0 = get_raw_stream(0)
buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256))
buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256))
add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0)
del buf11, buf12

stream0 = get_raw_stream(0)
buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0)
```

## Good autotuning input generation
```
# notice we generate with the proper size now (10500)
buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0)

stream0 = get_raw_stream(0)
buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0)
del buf11, buf12

stream0 = get_raw_stream(0)
buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164013
Approved by: https://github.com/cp2923, https://github.com/laithsakka
2025-10-02 02:40:54 +00:00
3e03deab6f C++-accessible Placements via pybind11 (#163030)
This makes Placement data representation available in C++ via pybind11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163030
Approved by: https://github.com/ezyang
2025-10-02 02:38:23 +00:00
349e9e922d [cutass backend] remove cutlass presets (#164380)
Differential Revision: [D83674898](https://our.internmc.facebook.com/intern/diff/D83674898/)

Changes made by claude code (need to remove test too)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164380
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-10-02 01:26:00 +00:00
8b29c59844 [CI][CUDA] Fix distributed tests for b200 (#164345)
This PR fixes the tests that were encountered in #159323.
Namely it fixes #162746 and #162745.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164345
Approved by: https://github.com/eqy
2025-10-02 01:13:49 +00:00
53860ef4e1 Better error handling in torch/csrc/jit/codegen/* (#163948)
Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in torch/csrc/jit/codegen/*

Fixes some parts of ISSUE #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163948
Approved by: https://github.com/cyyever, https://github.com/FFFrog, https://github.com/albanD
2025-10-02 01:10:09 +00:00
723ba21393 Speed up FP precision lookup (#164044)
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-02 00:59:19 +00:00
a10207e61b Revert "[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760)"
This reverts commit 0925c644edafbb6a8ff42fef5f3bd48b6042fad3.

Reverted https://github.com/pytorch/pytorch/pull/162760 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162760#issuecomment-3358630631))
2025-10-02 00:44:44 +00:00
ffda8e5ddf [inductor] log kernel autotuning result to a csv (#164191)
Example output: https://gist.github.com/shunting314/2d646c6b6cd9a79fff7a35ffee82baed
```
for each model:
  for each triton kernel:
     for each triton config:
        the csv contains a line for the latency and pointer to find the kernel module in the file system
```

Would use this to try to come up with heuristics to pick a single config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164191
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-10-02 00:25:34 +00:00
1a5d023a5b Add B200 to Operator Microbenchmark CI (#164288)
Add B200 to operator microbenchmarks nightly run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164288
Approved by: https://github.com/huydhn
2025-10-01 23:56:34 +00:00
566ea4e86a Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980)
Work Around for: https://github.com/pytorch/pytorch/issues/133437

Test plan:
1. Build whl in CI
2. Download
3. Run ``nm -D libtorch_cpu.so | grep "recursive_directory_iterator"``

Test with check_binary_symbols.py:

Success:
```
num_cxx11_symbols: 2326
num_pre_cxx11_symbols: 0
lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
num_statically_linked_symbols (T): 0
```

Fail when using "W" instead of "T" as type calling ``cxx11_statically_linked_symbols = grep_symbols(
        lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="W"
    )`` :
```
num_cxx11_symbols: 2326
num_pre_cxx11_symbols: 0
lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
num_statically_linked_symbols (T): 20
Traceback (most recent call last):
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 130, in <module>
    main()
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 126, in main
    check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 95, in check_lib_statically_linked_libstdc_cxx_abi_symbols
    raise RuntimeError(
RuntimeError: Found statically linked libstdc++ symbols (recursive_directory_iterator), but there shouldn't be any, see: ['std::filesystem::__cxx11::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::__cxx11::recursive_directory_iterator::depth() const', 'std::filesystem::__cxx11::recursive_directory_iterator::options() const', 'std::filesystem::__cxx11::recursive_directory_iterator::operator*() const', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::operator bool() const', 'std::filesystem::__cxx11::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::__cxx11::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::pop()', 'std::filesystem::__cxx11::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator&&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator const&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator++()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()']
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163980
Approved by: https://github.com/isuruf, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-01 23:17:30 +00:00
9065364995 Add xfailing test case for inplace mutation of local DTensor (#164355)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164355
Approved by: https://github.com/albanD
2025-10-01 23:16:26 +00:00
6eb8d9671b Enable torch.nn.functional.batch_norm in test_export_opinfo (#164261)
Summary:
There are actually 2 `nn.functional.batch_norm` in op_db. See https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_methods_invocations.py#L16797-L16831

So previously the test failed at `assert len(ops)==1`

Test Plan: python test/export/test_export_opinfo.py TestExportOnFakeCudaCUDA

Differential Revision: D83581427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164261
Approved by: https://github.com/SherlockNoMad
2025-10-01 21:56:08 +00:00
b5c4f46bb9 Add functions to setup PrivateUse1 as a python backend device. (#157859)
Fixes #156052 and #156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
2025-10-01 21:32:59 +00:00
773c6762b8 [CD][CUDA13][NCCL] Fix nccl version typo for cu13 (#164383)
https://pypi.org/project/nvidia-nccl-cu13/#history does not have 2.27.5 but 2.27.7+.
Companion PR: https://github.com/pytorch/pytorch/pull/164352

Fixes a potential binary breakage due to non-existence of referenced NCCL cu13 version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164383
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman
2025-10-01 21:32:25 +00:00
7320f44cdc Skip windows unittest in fbcode (#164363)
Summary: as title

Test Plan:
```
buck run fbcode//caffe2/test/inductor:aot_inductor_windows
```

Differential Revision: D83664801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164363
Approved by: https://github.com/angelayi
2025-10-01 20:18:19 +00:00
e5c0e6b5e3 [testing] Better short job name during upload additional stats (#164287)
I think we usually we leave the ` / test` in for clarity
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164287
Approved by: https://github.com/atalman, https://github.com/malfet
2025-10-01 19:56:20 +00:00
7304b9e7d2 [ROCm] fix carveout feature (#164303)
Fixes #164271.

Carveout had been applied with an opposite bitmask. Besides being incorrect, this lead to flaky unit test behavior due to carveout being too high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164303
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-01 19:25:41 +00:00
315ffdc1e4 [4/N] Apply ruff UP035 rule to python code (#164206)
Follows #164104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164206
Approved by: https://github.com/albanD
2025-10-01 19:05:53 +00:00
8c590cab9d [inductor] add a runtime assert for triton shapes (#164242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164242
Approved by: https://github.com/eellison, https://github.com/mlazos
ghstack dependencies: #164241
2025-10-01 18:55:33 +00:00
9357c31b53 [inductor] Fix constant shape for float constants (#164241)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164241
Approved by: https://github.com/mlazos
2025-10-01 18:55:33 +00:00
f63d16c6a9 Make viable/strict updatable again (#164374)
To allow viable/strict to move forward, after https://github.com/pytorch/pytorch/pull/164260 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164374
Approved by: https://github.com/seemethere
2025-10-01 18:09:07 +00:00
8dfc8efffd [export] Preserve nn_module_stack for aliased nn modules (#164311)
Preparing for install_free_tensors flag.

Thanks to @tugsbayasgalan in coming up with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164311
Approved by: https://github.com/tugsbayasgalan
2025-10-01 18:04:33 +00:00
3ffaab3bc8 [Replicate][Pipeline Parallelism] integration of new replicate function with pipeline parallelism (#164031)
**Summary:** In order to test numerics for replicate + pp, stage.py needs to be able to call replicate's backward manually as pipeline parallelism doesn't have this feature.

**Test Case**
1.  pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164031
Approved by: https://github.com/weifengpy, https://github.com/H-Huang
ghstack dependencies: #163897
2025-10-01 18:01:16 +00:00
ebd0707578 [SymmMem] Add get_nbi the nonblocking version (#163540)
```Py
@triton.jit
def foo(dest, src):
    nvshmem.get_nbi(dest, src, 100, 0)
    # Some independent computation which overlaps with the get operation
    ...
    # Wait for completion of the get operation
    nvshmem.quiet()
```

Allows us to overlap comm and compute in the same kernel, instead of two kernels + signals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163540
Approved by: https://github.com/ngimel, https://github.com/fegin
2025-10-01 17:50:24 +00:00
76ddbc2bbb Add option to FakeProcessGroup to raise error if comms are invoked. (#162841)
The current behavior is to do "nothing", which means you will corrupt
data.  If you're doing something similar to LocalTensor, where you're
overriding the behavior of collectives to do something numerically,
this can be unwelcome behavior.  If you can error when this happens
it can help prevent silent numerical incorrectness.

Authored with claude code.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162841
Approved by: https://github.com/dcci
2025-10-01 17:48:19 +00:00
69c5c08a01 Revert "[dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)"
This reverts commit 5ed4672477c71492a2f41ac0395dd0630446d6a5.

Reverted https://github.com/pytorch/pytorch/pull/164005 on behalf of https://github.com/williamwen42 due to broke some tests e.g. https://github.com/meta-pytorch/autoparallel/actions/runs/18167350261/job/51719783636?pr=179 ([comment](https://github.com/pytorch/pytorch/pull/164005#issuecomment-3357433475))
2025-10-01 17:47:22 +00:00
3dab36bdb4 [FSDP][Replicate] created ReplicateModule and changed replicate to use it instead of FSDPModule (#163897)
**Summary:** In order to minimize the code copied from FSDP to make replicate work, I made all replicated modules FSDPModule. While this was sufficient originally, there are changes to codebase like below that require us to differentiate between a FSDPModule and a ReplicateModule so that we can access replicate_state or fsdp_state: https://www.internalfb.com/code/fbsource/[a9a8e5102052]/fbcode/caffe2/torch/distributed/pipelining/stage.py?lines=629-666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163897
Approved by: https://github.com/weifengpy
2025-10-01 17:30:10 +00:00
1288c6d8bb Enable keep-going for trunk tags (#164307)
Tags like `trunk/{sha}` are used to re-run signals by [autorevert project](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md).

We need to have `keep-going` enabled for those reruns, so that they surface all test failures, not just the first one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164307
Approved by: https://github.com/clee2000
2025-10-01 17:21:43 +00:00
80ed522910 [export] support unbacked stack (#163867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163867
Approved by: https://github.com/laithsakka
2025-10-01 16:48:46 +00:00
f7ab8a2710 [1/N] Fix ruff warnings (#164333)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164333
Approved by: https://github.com/albanD
2025-10-01 16:48:32 +00:00
e419dc6d08 [PP] Customize pipeline's submod name (#164037)
Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037
Approved by: https://github.com/H-Huang
ghstack dependencies: #164045, #164035
2025-10-01 16:29:19 +00:00
5f868ca110 [fx] Allow customization of submod name in split graph (#164035)
Fixes #164030: HOP and pipelining both name things submod_i
by adding an optional argument `partition_affix` to `split_module` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164035
Approved by: https://github.com/ezyang
ghstack dependencies: #164045
2025-10-01 16:26:14 +00:00
20edc5b26a Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)"
This reverts commit 22c5e8c17c7551c9dd2855589ae774c1e147343a.

Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274))
2025-10-01 16:23:03 +00:00
59a86cb137 Revert "[fx] Allow customization of submod name in split graph (#164035)"
This reverts commit 615da7b95ef22ec0fa07f296dcb103d7d5aeda34.

Reverted https://github.com/pytorch/pytorch/pull/164035 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))
2025-10-01 16:09:50 +00:00
36a37b81cd Revert "[PP] Customize pipeline's submod name (#164037)"
This reverts commit 704cd771f6a63abf9498934aeb7f3079ab9e2232.

Reverted https://github.com/pytorch/pytorch/pull/164037 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))
2025-10-01 16:09:50 +00:00
2610746375 Revert nccl upgrade back to 2.27.5 (#164352)
Revert https://github.com/pytorch/pytorch/pull/162351 as it breaks H100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164352
Approved by: https://github.com/atalman, https://github.com/malfet
2025-10-01 15:27:40 +00:00
b1033789fe Use TMA loads always for Triton grouped MM kernel (#164256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256
Approved by: https://github.com/ngimel
ghstack dependencies: #163895
2025-10-01 15:24:51 +00:00
07d896fa48 Revert "CUDACachingHostAllocatorImpl skip event query during capture (#164001)"
This reverts commit 4cf29004749714670fee9e7e3776778faf5ced25.

Reverted https://github.com/pytorch/pytorch/pull/164001 on behalf of https://github.com/yangw-dev due to failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1.. ([comment](https://github.com/pytorch/pytorch/pull/164001#issuecomment-3356894787))
2025-10-01 15:11:21 +00:00
31681bcacc [PyTorch] Pull ARM's box-cox (#164152)
Summary:
ARM has provided with an SVE128 box-cox implementation.

It uses the same underlying algorithm as the previous version, but it has better log and exp implementations.
These supplied mathematical functions have switches to adjust the precision/speed trade-off.

We've noted a slight precision improvement, while also about a 5% peroformance increase

Before:

ZeroLambda1                                                61.66ns    16.22M
NonZeroLambda1                                            125.73ns     7.95M
NonZeroLambdaManyColumns                                    1.84ms    542.11
NonZeroLambdaEigenColumnar                                262.31us     3.81K
NonZeroLambdaEigenRowMajor                                275.17us     3.63K
NonZeroLambdaWithPyTorchColumnar                           97.43us    10.26K
NonZeroLambdaWithPyTorchRowMajor                           90.82us    11.01K
NonZeroLambdaWithPyTorchRowMajorFullBatch                  96.96us    10.31K
NonZeroLambdaBatch                                        151.84us     6.59K

After:

ZeroLambda1                                                57.85ns    17.29M
NonZeroLambda1                                            118.85ns     8.41M
NonZeroLambdaManyColumns                                    1.82ms    548.16
NonZeroLambdaEigenColumnar                                261.67us     3.82K
NonZeroLambdaEigenRowMajor                                274.53us     3.64K
NonZeroLambdaWithPyTorchColumnar                           89.12us    11.22K
NonZeroLambdaWithPyTorchRowMajor                           83.49us    11.98K
NonZeroLambdaWithPyTorchRowMajorFullBatch                  88.79us    11.26K
NonZeroLambdaBatch                                        144.74us     6.91K

Test Plan:
Correctness:

buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test

Performance:

buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark

Differential Revision:
D83485704

Privacy Context Container: L1196524

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164152
Approved by: https://github.com/ezyang
2025-10-01 15:00:03 +00:00
e901866dd7 Add a RECORD_FUNCTION for Python fallback so it shows in profile (#160573)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160573
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2025-10-01 14:10:44 +00:00
70d1043bdf Fix non-TMA loads in grouped MM Triton kernel (#163895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163895
Approved by: https://github.com/lezcano
2025-10-01 12:21:13 +00:00
69fa26d9b4 Triton 3.5.x pin update (#164268)
Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/

This updates contains 2 cherry-pick to remove Python 3.9 from list of supported python versions:
https://github.com/triton-lang/triton/pull/8288
https://github.com/triton-lang/triton/pull/8287
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164268
Approved by: https://github.com/aakhundov
2025-10-01 11:41:50 +00:00
d9c80ef97d Build and Install Arm Compute Library in manylinux docker image (#159737)
----

This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms.

In this PR:

- We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build.  Also updated jammy install path to be /acl too.
- We can therefore remove build_ArmComputeLibrary functions from the ci build scripts.
- There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update )
- We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ).
- ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737
Approved by: https://github.com/seemethere, https://github.com/aditew01
2025-10-01 11:33:51 +00:00
ac1bc51608 [dynamo] do not pop from framelocals dict in Python 3.10 (#164316)
Followup to https://github.com/pytorch/pytorch/pull/164038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164316
Approved by: https://github.com/anijain2305
2025-10-01 10:20:46 +00:00
ed90040d33 Releases multicast object before releasing mapped buffers in CUDASymmetricMemory (#163750)
Fixes: https://github.com/pytorch/pytorch/issues/162429. In B200, cuMulticastUnbind can error if the mapped buffers are free'd before the multicast object is free'd. The only documentation I could find is here: e11d7f77c1/src/transport/nvls.cc (L113).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163750
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/cyyever
ghstack dependencies: #163575
2025-10-01 09:07:48 +00:00
4dab208d97 Adds Issue#153109 as a test for CUDAPluggableAllocator (#163575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163575
Approved by: https://github.com/ngimel
2025-10-01 09:07:48 +00:00
9fd53a2bdc Register MTIA kernel for all_all_out (#164293)
Reviewed By: srsuryadev

Differential Revision: D83517879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164293
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-10-01 09:05:08 +00:00
17ab99463a [Easy] Add notes for setting up dev venv with specific Python version (#164214)
Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3340751377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164214
Approved by: https://github.com/ezyang
ghstack dependencies: #162324
2025-10-01 08:25:13 +00:00
eca6ac2293 [BE][Easy] update CUDA and ROCm sources in nightly tool (#162324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162324
Approved by: https://github.com/ezyang
2025-10-01 08:25:13 +00:00
12d4cb0122 Suppress FutureWarnings in torch.distributed.algorithms.ddp_comm_hooks (#163939)
Fixes #163938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163939
Approved by: https://github.com/cyyever, https://github.com/kwen2501
2025-10-01 07:51:12 +00:00
590224f83c Improve repeat op to a single copy (#163842)
In #163455 , the `reshape` was not a pure view op.

The `permute` before it created an non-contiguous tensor, which would trigger a data copy during the reshape.

This PR improved the implementation by remove the `urtensor` intermediate tensor completely.
By simply expanding the `xtensor` would achieve the `repeat` effect.

Before this PR, there were two data copies (in `urtensor.copy_` and `urtensor.reshape`).
Now, there is only one data copy in the `.copy_()`.
Reshape would not copy data because it is on a contiguous tensor.

One more note is that we do want at one copy because we want to duplicate the elements for the repeats.
User can inplace modify single elements without afffecting others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163842
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-01 06:27:53 +00:00
cc8b14d09a [2/N] Simplify "in" operation for containers of a single item (#164323)
These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164323
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-10-01 05:39:11 +00:00
96c3b9e275 [dynamo] Use strings instead of modules for fqn info tracking (#164272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164272
Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/mlazos
2025-10-01 04:22:57 +00:00
9ddfc59b9b [BE] Delete stale non-ephemeral runners workarounds (#164285)
As all Win runners are ephemeral, no need to cleanup leftover processes
or uninstall PyTorch at the end of the test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164285
Approved by: https://github.com/Skylion007
2025-10-01 03:47:36 +00:00
6d4dfa0878 [CI] Push viable/strict/${time} tags (#164183)
Every time viable strict is updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183
Approved by: https://github.com/seemethere
2025-10-01 03:41:10 +00:00
11ccb95ccb [PyTorch Pinned Allocator] Pinned memory stats and perf fixes around allocating blocks (#163777)
Summary: This diff adds bucket stats for pinned memory and also a perf fix to not check for sizes when background thread is enabled

Differential Revision: D83162186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163777
Approved by: https://github.com/bbus
2025-10-01 03:28:58 +00:00
bd0907dc4c [BE][CI] Unify requirments (#163396)
Both Linux, Windows and MacOS CI workflows should use `.ci/docker/requirements-ci.txt`
TODOS:
 - Investigate why `choco install cmake` is needed to successfully detect MKL
 - Move `psutil` installation from specific scripts into requirements-ci.txt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163396
Approved by: https://github.com/Skylion007
2025-10-01 03:28:48 +00:00
8bb71c07c4 Skip symmetric memory tests calling _scaled_mm on CCC < 8.9 (#164251)
This avoids them failing on e.g. A100 GPUs with
> RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164251
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-10-01 03:26:21 +00:00
fa90090735 Use dataclass features in two classes (#164221)
This PR completes two TODO items by using features of `dataclass`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164221
Approved by: https://github.com/Skylion007, https://github.com/mlazos

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-01 03:20:39 +00:00
591997490a [BE][Easy]: Add prims common TypeGuard (#164263)
Slightly improves typing by adding a TypeGuard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164263
Approved by: https://github.com/albanD
2025-10-01 03:13:10 +00:00
531f3bf5e1 Adding check for square matrix for input tensor in matrix_exp backwar… (#163357)
…d op.

Fixes #146796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163357
Approved by: https://github.com/lezcano
2025-10-01 03:12:30 +00:00
2a5ce2feb4 Add algorithm in header (#164295)
Fixes #163307. Added ```#include <algorithm>``` to vulkan QueryPool for the std::for_each call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164295
Approved by: https://github.com/Skylion007
2025-10-01 03:09:50 +00:00
3787a5a60e [export] Explicitly passing requires_grad to nn.Parameter() in deserialization (#164290)
Summary: `nn.Parameter()` by default has `requires_grad=True` and would cause issues when there are non-float parameters.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_non_float_weight

Differential Revision: D83598796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164290
Approved by: https://github.com/angelayi
2025-10-01 02:55:20 +00:00
c66d18d24d [dynamo][sac] Support functools partial context_fn for sac (#164308)
Fixes https://github.com/pytorch/pytorch/issues/164300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164308
Approved by: https://github.com/Lucaskabela, https://github.com/soulitzer
2025-10-01 02:47:55 +00:00
e0f118585f skip non memory deps in memory estimator (#164294)
Differential Revision: [D83601030](https://our.internmc.facebook.com/intern/diff/D83601030)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164294
Approved by: https://github.com/mlazos
2025-10-01 02:44:58 +00:00
10a005e87f [torchfuzz] add layout operators (#164210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164210
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211
2025-10-01 02:33:19 +00:00
1f3995cdc8 [torchfuzz] raise if Operator abstract method is not implemented (#164211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164211
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209
2025-10-01 02:33:19 +00:00
abfcce58a4 [torchfuzz] remove erroneous can_produce check (#164209)
can_produce is an abstract method that always return false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164209
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034
2025-10-01 02:33:19 +00:00
5b1c39f5a1 Add smoke tests to verify that stable ABI FA3 wheel runs w/ newer torch (#163782)
Passing CI: https://github.com/pytorch/pytorch/actions/runs/18141589975/job/51635340255?pr=163782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163782
Approved by: https://github.com/huydhn, https://github.com/mikaylagawarecki
2025-10-01 02:30:38 +00:00
8df3f2fa98 Revert new-test part of #163829 (#164259)
Summary:

New test sizes for `test_scaled_mm_vs_emulated_block_wise` all fail with

```
RuntimeError: Invalid scaling configuration
```

Disable these new tests for now (the remaining test is a parametrized
version of the original test case)

Test Plan:

`pytest test/test_scaled_matmul_cuda.py`

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164259
Approved by: https://github.com/jananisriram
ghstack dependencies: #164266
2025-10-01 02:23:21 +00:00
7a9119948e Split scaled-mm tests into separate file (#164266)
Summary:

* Split scaled-mm-specific tests into `test/test_scaled_matmul.py`

Test Plan:

```
pytest test/test_matmul_cuda.py
pytest test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164266
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-10-01 02:23:21 +00:00
28c1d2f81b [aoti] AOTI mingw cross compilation (#163188)
To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit.

See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0

If cross_platform_target is windows, we do the following:

- do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side
- Use `mingw64-gcc-c++` to compile
- Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda

```
 python test/inductor/test_aot_inductor_windows.py -k so
 ```

 Other changes:
 - de-couples compile_standalone config and dynamic link flag
 - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188
Approved by: https://github.com/desertfire
2025-10-01 02:22:06 +00:00
c4bbc6433e [PyTorch CCA] Add an API to get expandable segment sizes (#163771)
Summary: This diffs add an API to query expandable segment size for each stream so that we can use this info to warmup the segment in advance, so we dont incur any performance penalty during steady state inference for new CUDA memory allocations.

Differential Revision: D76447308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163771
Approved by: https://github.com/bbus
2025-10-01 02:16:58 +00:00
ad7e3c93b1 [ROCm][CD] librocroller.so missing from ROCm 7 wheel (#164244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164244
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-01 00:02:34 +00:00
7f3dc45300 Migrate DeviceType to torch/headeronly (#163999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999
Approved by: https://github.com/mikaylagawarecki
2025-09-30 23:13:27 +00:00
ff715366aa [vllm hash update] update the pinned vllm hash (#164190)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190
Approved by: https://github.com/pytorchbot
2025-09-30 22:43:49 +00:00
60a4961ff4 [DTensor] Allow redistribute to Partial if src matches (#164253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164253
Approved by: https://github.com/zpcore
2025-09-30 22:42:49 +00:00
bec6541d84 [CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186)
Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565)

This PR removes capture/reply overhead while preserving the memory savings:

1. **Terminals as free markers**
   We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged.

2. **Incremental, cached reachability**
   We add a **per-graph reuse context** that caches reverse-traversal state:

   * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier.
   * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes.
   * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work.

See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-09-30 22:28:46 +00:00
1f1de20ba9 [c10d][BE][ez] Update tensor ptr inside nccl.cpp (#164276)
This is mostly a cosmetic change which replace the deprecating `data_ptr` API with mutable or const one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164276
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
2025-09-30 22:05:12 +00:00
2810977d3a [FSDP][Replicate] tests replicate type casting behavior and edge cases in mixed precision (#162861)
**Summary:** Ensures that replicate can handle the same type casting behavior and edge cases that fully shard can when mixed precision is used

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_float16_on_one_submodule
2. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_submodules_with_external_inputs
3. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_bf16
4. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_fp16
5. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_clamp_reduce_dtype
6. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_dataclass_input

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162861
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851, #162853, #162855
2025-09-30 22:03:23 +00:00
ae4fd4ea75 [FSDP2] support AC(FSDP) for torchtitan's MOE (#164009)
for fsdp2 + EP, titan has fully_shard(AC(layer)) and fully_shard(layer.moe.experts): https://github.com/pytorch/torchtitan/issues/1624

for implicit prefetching, backward order is
* _pre_backward unshard (norm, output)
* _backward_prefetch unshard layers.6
* post_backward reshard (norm, output)
* _pre_backward unshard layers.6 (no-op, unsharded already)
* _backward_prefetch unshard layers.6.moe.experts
* recompute_fn pre_forward unshard layers.6.moe.experts (no-op, unsharded already)
* ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR make it a no-op
* _pre_backward unshard layers.6.moe.experts (no-op, unsharded already)
* _backward_prefetch unshard layers.5
* post_backward reshard layers.6.moe.experts
* post_backward reshard layers.6

unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_comm.py -k test_set_modules_to_backward_prefetch_inside_ac`

before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - step:  1  loss: 12.0162  grad_norm:  1.7315  memory: 45.64GiB(48.05%)  tps: 1,028  tflops: 10.87  mfu: 1.10%
[rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:43:35,233 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 11:43:35,987 - root - INFO - step: 50  loss:  6.9302  grad_norm:  0.9985  memory: 59.66GiB(62.80%)  tps: 11,712  tflops: 123.89  mfu: 12.53%
```

after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - step:  1  loss: 12.0134  grad_norm:  1.6916  memory: 38.42GiB(40.45%)  tps: 805  tflops: 8.51  mfu: 0.86%
[rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:39:28,541 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 11:39:29,279 - root - INFO - step: 50  loss:  6.9346  grad_norm:  1.1875  memory: 52.58GiB(55.36%)  tps: 12,583  tflops: 133.10  mfu: 13.46%
```

for explicit prefetching, layers.6 backward prefetch layers.5 and layers.5.moe.experts. layers.6.moe.experts does not have explicit prefetch. backward order is like this
* _pre_backward unshard (norm, output)
* _prefetch_unshard layers.6
* post_backward reshard (norm, output)
* _pre_backward unshard layers.6 (no-op, unsharded already)
* _prefetch_unshard layers.5
* _prefetch_unshard layers.5.moe.experts
* recompute_fn pre_forward unshard layers.6.moe.experts
* ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR makes it a no-op
* _pre_backward unshard layers.6.moe.expert (no-op, unsharded already)
* post_backward reshard layers.6.moe.expert
* post_backward reshard layers.6

before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - step:  1  loss: 12.0180  grad_norm:  1.6948  memory: 45.77GiB(48.18%)  tps: 849  tflops: 8.98  mfu: 0.91%
[rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:53:57,768 - root - INFO - [GC] Performing periodical GC collection 0.07 seconds
[rank0]:[titan] 2025-09-30 11:53:58,515 - root - INFO - step: 50  loss:  6.9358  grad_norm:  1.0528  memory: 59.80GiB(62.95%)  tps: 11,827  tflops: 125.10  mfu: 12.65%```
```

after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - step:  1  loss: 12.0143  grad_norm:  1.7030  memory: 38.55GiB(40.58%)  tps: 988  tflops: 10.45  mfu: 1.06%
[rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 12:09:10,482 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 12:09:11,168 - root - INFO - step: 50  loss:  6.9356  grad_norm:  0.9911  memory: 52.81GiB(55.59%)  tps: 12,637  tflops: 133.68  mfu: 13.52%
```

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164009
Approved by: https://github.com/soulitzer
2025-09-30 22:02:24 +00:00
adc11a7634 [export] avoid checks during tracing of export verification (#164219)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164219
Approved by: https://github.com/Lucaskabela
2025-09-30 21:46:59 +00:00
99e28ffab3 [FSDP][Replicate] tests replicate core functionality with mixed precision (#162855)
**Summary:** Ensures that replicate functionality works the same as fully shard's when mixed precision is used

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k TestReplicateMixedPrecisionTraining

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162855
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851, #162853
2025-09-30 21:45:58 +00:00
01dd2c2b42 [FSDP][Replicate] tests replicate is composable with tp (#162853)
**Summary:** Proof that new replicate API is composable with TP

**Test Case**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_replicate_tp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162853
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851
2025-09-30 21:29:54 +00:00
d3bdf8c32e [FSDP][Replicate] tests replicate with custom forward method (#162851)
**Summary: tests replicate works when users use custom forward methods**

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_register_fsdp_forward_method

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162851
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839
2025-09-30 21:15:34 +00:00
1ce9563ff6 [FSDP][Replicate] tests replicate gradient accumulation and 1f1b microbatching (#162839)
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation
2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162839
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836
2025-09-30 21:00:16 +00:00
9e631392dc Missing lambda in torch._check (#164225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164225
Approved by: https://github.com/Skylion007
2025-09-30 20:32:38 +00:00
1cce6efdb8 Fix silent incorrectness for bmm/baddmm out_dtype overload (#164095)
Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in https://github.com/pytorch/pytorch/issues/163816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164095
Approved by: https://github.com/ngimel
2025-09-30 20:13:13 +00:00
5a93f00c79 [CI] Delete binary smoke workflows (#164260)
Those were very useful in the past, because:
- CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels
- CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration

Existing, but acceptable gaps:
 - Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small
 - All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions
 - CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals

Besides there are better tools to alert about the nightly failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164260
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-09-30 20:00:07 +00:00
e30f01b5b5 [1/N] Simplify "in" operation for containers of a single item (#164224)
These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164224
Approved by: https://github.com/rec, https://github.com/Skylion007
2025-09-30 19:59:43 +00:00
ffc645c870 half support for fused_moving_avg_obs_fake_quant() op (#164175)
Follow up to https://github.com/pytorch/pytorch/pull/162620.  Add half support, as well.  This fixes some failures in inductor benchmarks such as from this log https://github.com/pytorch/pytorch/actions/runs/18051942373/job/51376749459.

`NotImplementedError: "aminmax_kernel" not implemented for 'Half'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164175
Approved by: https://github.com/malfet, https://github.com/jerryzh168
2025-09-30 19:35:17 +00:00
60f0a356fd Update persons of interest for XLA. The previous one is out of date. (#158652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158652
Approved by: https://github.com/JackCaoG, https://github.com/albanD
2025-09-30 19:21:18 +00:00
d2c5f231f6 Fix the shape check inside gnll loss (#147522)
Fixes #147521
This modification allow user to put any size of var in GaussianNLLLoss if the var is broadcastable (to input/target's size)

Therefore, the demo code in #147521 will result in expected behaviour and correct output.

This allow all input size that match:
`input.size = (..., n, ...), var.size = (..., 1, ...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147522
Approved by: https://github.com/mikaylagawarecki
2025-09-30 18:40:15 +00:00
cc5d74c366 Revert "[BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)"
This reverts commit 94195a37ae4eae9c486a81b0f67725c8970f74d6.

Reverted https://github.com/pytorch/pytorch/pull/163464 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163464#issuecomment-3353307034))
2025-09-30 18:20:20 +00:00
a707042353 fix: inductor non_blocking test - warmup events to make test pass whether it is the first run or not (#164188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164188
Approved by: https://github.com/williamwen42
2025-09-30 18:20:17 +00:00
d615f6b935 [inductor] use hint_override in kernel benchmark args (#164207)
Summary: forward fix T239259207

Test Plan: test_multi_kernel

Differential Revision: D83539263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164207
Approved by: https://github.com/bobrenjc93, https://github.com/mlazos
2025-09-30 18:09:29 +00:00
719b64ee8b Fix TMA transpose logic to handle 1D shapes + string differences (#163966)
Fixes #163702.

This fixes 2 issues:
1. The value may inconsistently be a shape or string. This normalizes to handle both of these.
2. 1D shapes should not transpose data. This fixes the order of operations to prevent this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966
Approved by: https://github.com/eellison
2025-09-30 17:51:37 +00:00
1cf1b9138d [inductor][templates] Template hooks should be finalised inside a kernel context (#164229)
The prologue buffer added in https://github.com/pytorch/pytorch/pull/160480 is added to template code in the DEF_KERNEL [hook](29221b9828/torch/_inductor/select_algorithm.py (L742)). The lines in this buffer may be of type `DeferredLine`, and so require the correct kernel context to determine whether lines should be added or removed.

Test plan:

Tested with a custom template using tensor descriptors for prologue fused inputs, whose tensor descriptors need to be hoisted to the top of the kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164229
Approved by: https://github.com/njriasan
2025-09-30 17:50:59 +00:00
5ed4672477 [dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919, #163920, #164004
2025-09-30 17:43:03 +00:00
2600f8b3d1 [dynamo, 3.14] fix tracing typing.Union (#164004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164004
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919, #163920
2025-09-30 17:43:03 +00:00
9ce31e4278 [3.14] make unbacked_sym[int/float]_counter integers (#163920)
3.14 removed copy/deepcopy/pickle support for `itertools` iterators: https://docs.python.org/3.14/whatsnew/3.14.html#itertools

Change unbacked_sym[int/float]_counter from `itertools.count` to regular integers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163920
Approved by: https://github.com/ezyang
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919
2025-09-30 17:42:55 +00:00
0657de9c61 [dynamo, 3.14] support LOAD_COMMON_CONSTANT (#163919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163919
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818
2025-09-30 17:42:47 +00:00
4ead8ebf70 [dynamo, 3.14] fix BUILD_TUPLE with 0 args (#163818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163818
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796
2025-09-30 17:42:40 +00:00
d4b785a6a7 [dynamo, 3.14] fix stack ref copy error (#163796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163796
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292
2025-09-30 17:42:33 +00:00
9278b18ec0 [dynamo, 3.14] fix WITH_EXCEPT_START (#163292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163292
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191
2025-09-30 17:42:26 +00:00
008b0a9425 [dynamo, 3.14] fix inactive ctx handling in resume functions (#163191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163191
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110
2025-09-30 17:42:19 +00:00
44677ad917 [dynamo, 3.14] support LOAD_CONST on slice, codegen LOAD_CONST slice instead of BINARY/STORE_SLICE (#163110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163110
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109
2025-09-30 17:42:11 +00:00
1c9987fdf4 [dynamo, 3.14] fix context managers (#163109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163109
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #161838, #161555, #161839, #163009
2025-09-30 17:42:03 +00:00
7cbc011700 [dynamo, 3.14] support some bytecodes, fix CALL_FUNCTION_EX (#163009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163009
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839
2025-09-30 17:41:56 +00:00
09c774145e [dynamo, 3.14] Python dynamo changes to get basic programs working (#161839)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161839
Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305
ghstack dependencies: #161838, #161555
2025-09-30 17:41:49 +00:00
763ab2a6ed [dynamo, 3.14] compile actual code in C dynamo (#161555)
No 3.14 CI tests enabled yet, but this was enough to get Dynamo compiling locally and Python Dynamo is at least being called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161555
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838
2025-09-30 17:41:42 +00:00
4b8fe795f8 [dynamo] format cpython_defs.c (#161838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161838
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2025-09-30 17:41:35 +00:00
84e1cd7392 [inductor] fx comm overlap: align runtime estimations across dist ranks (#164226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164226
Approved by: https://github.com/eellison
2025-09-30 17:29:18 +00:00
937869657e Exporting aten.sdpa with cuda under fake mode on a cuda-less machine (#164162)
Summary:
As titled.

sdpa will select backend based on hardware check, and it fails when exporting with cuda under fake mode on a cuda-less machine.

We guard `at::cuda::is_available()` check before `at::cuda::getCurrentDeviceProperties()` and give warnings.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r nn_functional_scaled_dot_product_attention

Differential Revision: D83496154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164162
Approved by: https://github.com/SherlockNoMad
2025-09-30 17:21:31 +00:00
7d7ae4d7b2 [submodule] upgrade cutlass version to 4.2.1 and completely resolved python/cutlass name collision (#164156)
Differential Revision: D83489362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164156
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-30 17:04:57 +00:00
906fe7b120 [ROCm][CI] no longer build almalinux image for ROCm 6.3 (#164201)
Missed during ROCm 7 upgrades.  We only build N and N-1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164201
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-30 16:59:31 +00:00
7edd18f0fd [Inductor-FX] Generalize FloorDiv conversion to handle more complex launch grids. Remove python_slow grid mode. (#163828)
# Problem
Inductor's FX backend receives sympy expressions for Triton launch grids, and passes these to a tracer to generate equivalent FX IR. However, the tracer does not support all possible sympy expressions. In particular, it can't handle ops like `floor` and `Pow` which would be found in an expression like `floor(x / y)`. Instead, it expects `FloorDiv(x, y)`, which has the advantage that all intermediate values are integers, unlike `x / y`.

Inductor's Python backend uses a trick where `ceil(x / y)` is computed in Python as `-(x // -y)`, which is faster when evaluating Python launch grids at runtime. However, this trick generates more complex sympy expressions, so the FX backend introduced a `"python_slow"` mode using a more familiar form of ceil division. However, this mode is slower to evaluate, which increased production CPU usage. (Internal reviewers see T237853632.)

# Solution
To get the best of both worlds, this PR removes `"python_slow"` mode, and generalizes the `replace_floor_div` function  to handle the more complex expressions resulting from the `"python"` grid mode. The new algorithm is conceptually similar to the existing one, except instead of analyzing only the first argument to a `sympy.Mul` op, it checks all factors, so it can handle expressions containing both `Rational` and `Pow` ops, among other cases. It also uses `Mul.make_args` to handle the case when the argument to `floor` is not a `Mul`. Finally, it uses `expr.is_positive` to check the sign of symbolic exponents.

This new algorithm is guaranteed to convert all `floor` ops to an equivalent expression using `FloorDiv`. (To see this, consider that `floor(x) == FloorDiv(x, 1)`.) Note it may not remove all `Pow` ops, with a counterexample being `floor(x / (2 + z ** y))`, but it covers everything we've seen in practice for symbolic launch grids. In particular, it covers the typical case where `Pow` is a factor of the argument to `floor`, and the exponent is `-1`. Is this situation, we move the `Pow` to the denominator of `FloorDiv` and the exponent becomes `1`, eliminating the `Pow` op.

# Test plan
This PR adds an end-to-end test for static padding with dynamic outer dimensions, which creates a difficult sympy expression that the existing algorithm would not be able to handle.

This PR also adds some unit tests for the `replace_floor_div` function. It can be difficult to construct end-to-end tests that expose all the trickiest expressions, as those tests have to pass through a number of other systems handling dynamic shapes. Therefore, it's easier to expose the edge cases with these new unit tests. The tests check that we can replace all `floor` ops in the input expression with `FloorDiv`, then they expand `FloorDiv` back to `floor` and check equality with the original expression.

Note this PR also requires some MTIA changes to pass internal tests. Those will be stacked onto the imported diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163828
Approved by: https://github.com/nandesuka, https://github.com/angelayi, https://github.com/jansel
2025-09-30 16:47:49 +00:00
3564cd294c Fix TestExportOpInfo (#164184)
Fixes https://github.com/pytorch/pytorch/issues/163699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164184
Approved by: https://github.com/yiming0416, https://github.com/tugsbayasgalan
2025-09-30 16:12:39 +00:00
1412a4a42f [precompile] Add option to disable guard check on aot-compiled function. (#163432)
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.

When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm, https://github.com/mlazos
2025-09-30 16:10:15 +00:00
96330f490d [testing] Add upload for test status during test stat uploads (#164189)
Add test status (flaky, success, skipped, failure) upload for easier comparison between test status on two commits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164189
Approved by: https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-30 15:53:53 +00:00
eqy
66abba8f49 [CUDA][Expandable Segments] Follow-up cleanups for even more expandable segments tests (#163297)
Gets original setting even earlier in case of crashes, fixes previous get call where set should be

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163297
Approved by: https://github.com/Skylion007
2025-09-30 15:39:14 +00:00
e88cca0691 Update Sphinx theme (#164147)
Fix links in the top nav bar: 71e55749be

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164147
Approved by: https://github.com/albanD
2025-09-30 15:35:58 +00:00
5c020beba4 Update LPPool docs to clarify ceil_mode padding semantics when ceil_mode=True (#163186)
# Summary

- Add a note to each `nn.LPPool*d` docstring explaining how `ceil_mode=True` interacts with right padding.
- Mirror the same clarification in the `torch.nn.functional.lp_pool*` docstrings so the rendered functional docs stay in sync.

# Motivation

The current PyTorch spec for **LPPool** does not fully match runtime behavior, which has led to downstream confusion in other specs (e.g., ONNX) and runtimes (e.g., [onnxruntime issue #25848](https://github.com/microsoft/onnxruntime/issues/25848)). A corresponding clarification was also made in the ONNX spec: [onnx/onnx#5741](https://github.com/onnx/onnx/pull/5741).

PyTorch’s **LPPool** implementation calls into **AvgPool**, which enforces the rule that windows starting entirely in the right padded region are ignored when `ceil_mode=True`. As a result, **LPPool** inherits the same behavior.

This is an edge case where the output size formula shown in the LPPool docs/spec is not sufficient on its own. Without the added caveat, the documentation is technically incorrect. This PR brings the LPPool docs in line with actual behavior.

Note that this is a trivial fix to the spec as all major implementers of the spec adhere to this caveat.

For comparison, both **MaxPool** and **AvgPool** already include this clarification in their spec. Their docstrings explicitly state:

> *When `ceil_mode=True`, sliding windows are allowed to go off-bounds if they start within the left padding or the input. Sliding windows that would start in the right padded region are ignored.*

Adding the same note to LPPool ensures consistency across all pooling operators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163186
Approved by: https://github.com/mikaylagawarecki
2025-09-30 15:22:46 +00:00
edd9e07aff [BE] Remove not existing mnist mirror (#164238)
Looks like original source is empty now:
http://yann.lecun.com/exdb/mnist/

Pytorch hosted mirror exist. Hence leaving it as only option.
https://ossci-datasets.s3.amazonaws.com/mnist/

Fixes these errors in pytorch/ci:
```
C:\actions-runner\_work\pytorch\pytorch>python tools\download_mnist.py --quiet -d C:\actions-runner\_work\pytorch\pytorch\test\cpp\api\mnist
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz ...
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz ...
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz ...
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz ...
```

Link to workflow with example:
https://github.com/pytorch/pytorch/actions/runs/18109150240/job/51542177282#step:15:2335
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164238
Approved by: https://github.com/jeanschmidt
2025-09-30 15:15:13 +00:00
0fb89b84b9 Revert "Consistently use c10_ovrsource in arvr mode everywhere (#164128)"
This reverts commit efd7fd5ed5ac7ec03201a546a09fb19ec59de431.

Reverted https://github.com/pytorch/pytorch/pull/164128 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164128#issuecomment-3352544006))
2025-09-30 14:43:52 +00:00
79fcfd49d6 Revert "[CI] Push viable/strict/${time} tags (#164183)"
This reverts commit 9f27b0c24515d9cf319d9a728d5009bf9ed035cf.

Reverted https://github.com/pytorch/pytorch/pull/164183 on behalf of https://github.com/malfet due to Hmm, didn't work that way ([comment](https://github.com/pytorch/pytorch/pull/164183#issuecomment-3352494098))
2025-09-30 14:32:46 +00:00
71b4fada57 Revert "Add less warps config to inner reductions (#162447)"
This reverts commit 84d673ef577d42d6ec20c6c9f09863583c3111f5.

Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3352474768))
2025-09-30 14:28:19 +00:00
46ec0664e3 Remove unused PyIntXXX, THPUtils_newReal_BOOL, THPQXXX macros (#164056)
The removed macros are not used in other places of the `pytorch` GitHub org.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164056
Approved by: https://github.com/albanD
2025-09-30 13:48:25 +00:00
410ed3006b Revert "Add functions to setup PrivateUse1 as a python backend device. (#157859)"
This reverts commit 1310d6a1f9194ddcf6753f7e12fb78f278451f8a.

Reverted https://github.com/pytorch/pytorch/pull/157859 on behalf of https://github.com/jeanschmidt due to introduce linting errors ([comment](https://github.com/pytorch/pytorch/pull/157859#issuecomment-3352140098))
2025-09-30 13:24:37 +00:00
77354e22e1 [OpenReg] Add AMP Integration guide for accelerators (#162050)
Fix part of #158917

Add AMP integration document and OpenReg code as example to explain steps of integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162050
Approved by: https://github.com/albanD

Co-authored-by: FFFrog <ljw1101.vip@gmail.com>
2025-09-30 12:27:11 +00:00
7f29c47a4f Fix cdist export compute mode validation (#161724)
Fixes #161089. Added '0' as the acceptable value for compute mode in _meta_registrations.py. Also, added a test case in test_export.py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161724
Approved by: https://github.com/albanD, https://github.com/angelayi
2025-09-30 12:23:20 +00:00
ace6c76103 [inductor] Small refactor of CachingAutotuner (#162406)
This is a simple refactor that just moves some logic in `_precompile_config` to two new functions for separation of concerns. This will allow subclasses e.g. out of tree to configure options and metadata for triton.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162406
Approved by: https://github.com/exclamaforte
2025-09-30 11:29:15 +00:00
1310d6a1f9 Add functions to setup PrivateUse1 as a python backend device. (#157859)
Fixes #156052 and #156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
2025-09-30 08:39:36 +00:00
7f4c3e7d2f distributed/serialization: support zero sized tensors (#164198)
Fixes
```
[4] ValueError: both buffer length (0) and count (-1) must not be 0
```

Test plan:

```
pytest test/distributed/test_serialization.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164198
Approved by: https://github.com/amirafzali
2025-09-30 08:11:29 +00:00
6e5b4249a5 [DTensor][Export] Supporting exporting a model with DTensor params/inputs (#163609)
I experimented with 3 paths to get joint graph for DTensorized module and input

1. strict_export + aot_export_joint_with_descriptors
2. graph_capture + aot_export_joint_with_descriptors
3. aot_export_joint_with_descriptors alone

Added test to guard them.

1 doesn't work, as bw graph region is missing from the joint graph.
I am leaning towards making 2 the recommended path.
If 2 doesn't work going forward, we can fallback to 3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163609
Approved by: https://github.com/tugsbayasgalan

Co-authored-by: suo <suo@fb.com>
2025-09-30 07:54:13 +00:00
5274753873 [dynamo][device_mesh] Support mesh_dim_names (#164200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164200
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel
2025-09-30 07:16:28 +00:00
7afcb030d8 Back out "Revert D81959389" (#163905)
Summary:
Original commit changeset: 06888d7ebff0

Original Phabricator Diff: D82932788

Restricted the test to SM90 for scaled_grouped_mm

Test Plan: TBD (will share the linux CI results)

Differential Revision: D83283991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905
Approved by: https://github.com/angelayi
2025-09-30 07:05:13 +00:00
bbf6816f35 [dynamo] Special path for cloning of torch dispatch tensors (#164081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164081
Approved by: https://github.com/tugsbayasgalan, https://github.com/mlazos
2025-09-30 05:15:56 +00:00
ace89350fc better error handling for rrelu when lower or upper range is infinite (#160965)
… - issue#153281

Fixes #153281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160965
Approved by: https://github.com/janeyx99
2025-09-30 05:01:32 +00:00
7d59e37434 Add Comm-Compute Preserving Bucketer (#163960)
tl;dr performs bucketing while preserving comm-compute overlap.

In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

We perform bucketing while augmenting the graph with these relationships. This can be done separably from comm-compute overlap, so long as the hiding compute relationships are passed in.

TODO:
- need to instrument fx graph so inductor respects these relationships.
- the compile time of the bucketing search can be sped up significantly by limiting what portion of the graph we traverse through
- more memory aware handling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163960
Approved by: https://github.com/ruisizhang123, https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754, #163959
2025-09-30 04:53:58 +00:00
92108f4abd Helper to augment graph with additional deps (#163959)
In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754
2025-09-30 04:53:58 +00:00
0b2fdc30a2 refactor bucketing (#163754)
Preparatory refactory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #163215
2025-09-30 04:53:58 +00:00
0d7994ca97 [inductor] do comm compute overlap at aten fx level (#163215)
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
2025-09-30 04:53:58 +00:00
c39357bab6 [torchfuzz] Make scalar and tensor distribution configurable (#164034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164034
Approved by: https://github.com/pianpwk
2025-09-30 04:50:54 +00:00
a293206bd5 Fix invalid f-strings (#164112)
Fixes invalid f-strings detected by `ruff`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164112
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-30 04:17:13 +00:00
9f27b0c245 [CI] Push viable/strict/${time} tags (#164183)
Every time viable strict is updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183
Approved by: https://github.com/seemethere
2025-09-30 04:00:22 +00:00
85012fe167 Remove unnecessary list comprehensions (#164103)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-09-30 03:56:54 +00:00
ca19815e3c Revert "Enable outer reductions in fbcode (#163884)"
This reverts commit 872edd89d62f0095d3fbd8ae9204d7c8bd980460.

Reverted https://github.com/pytorch/pytorch/pull/163884 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163884#issuecomment-3349822031))
2025-09-30 03:42:24 +00:00
0b0ed6fd33 [doc] Add AOTInductor intermediate debug printer OSS user manual (#163794)
Summary: Add a OSS user manual for AOTI intermediate debug printer so we can link it in the Pytorch conference poster.

Test Plan: N/A

Differential Revision: D83171374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163794
Approved by: https://github.com/yushangdi
2025-09-30 03:01:03 +00:00
55840fb4bb [CMake] Fix USE_FBGEMM_GENAI option (#164165)
----

- `cmake_dependent_option` condition should be `USE_ROCM OR (USE_CUDA AND NOT MSVC)` (similar to the one for flash attention)
- Default settings should be user overridable, i.e. even if one builds for SM_10, they should be able to pass `USE_FBGEMM_GENAI=0` and skip the build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164165
Approved by: https://github.com/Skylion007
2025-09-30 02:38:03 +00:00
b7419b920d [ROCm][CI] Upgrade ROCm to 7.0 (#163140)
Upgrade all the ROCm docker image to ROCm 7.0 release version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163140
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-30 02:23:26 +00:00
3b4ad4a17d [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988)
See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark.

Fixes https://github.com/pytorch/pytorch/issues/163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: https://github.com/pytorch/pytorch/pull/119750 and 5c814e2527

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  c6ad34f7eb/python/triton/knobs.py (L216) would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988
Approved by: https://github.com/atalman
2025-09-30 01:56:12 +00:00
4cf2900474 CUDACachingHostAllocatorImpl skip event query during capture (#164001)
The CUDACachingAllocator already does this, so there is precedent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164001
Approved by: https://github.com/eqy
2025-09-30 01:19:53 +00:00
474d07554a [dynamic shapes] unbacked-safe slicing (#161414)
Summary:
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Test Plan:
contbuild & OSS CI, see 56218d85e2

Rollback Plan:

Differential Revision: D80948073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414
Approved by: https://github.com/laithsakka
2025-09-30 01:15:19 +00:00
089f9130ed Install fmtlib headers. (#164139)
`fmtlib` version was updated to 12.0.0 in #163441.

In this new version, due to https://github.com/fmtlib/fmt/pull/4536, PyTorch started not installing `fmtlib` headers anymore. Because of that, PyTorch/XLA build CI started to fail https://github.com/pytorch/xla/issues/9653. While we did fix it internally https://github.com/pytorch/xla/pull/9650, I believe that PyTorch should continue installing the `fmtlib` headers, since it is a dependency of its C API [`python_arg_parser.h`][1].

PyTorch/XLA CI was moved to `unstable.yml` in #159272, and later removed in #163564. This PyTorch/XLA build failure went under the radar, since the `fmtlib` update only landed on September 22.

[1]: 84d673ef57/torch/csrc/utils/python_arg_parser.h (L42)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164139
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-09-30 01:10:13 +00:00
da003d7b95 [3/N] Import Callable from collections.abc in torch/distributed (#164104)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
This PR is the follow-up of #164054.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104
Approved by: https://github.com/Skylion007
2025-09-30 00:28:53 +00:00
cee4e36f9a [BE] remove manylinuxcxx11-abi-builder:cpu-cxx11-abi docker image (#164187)
I believe this image is not used anywhere anymore.

Test:
```
git grep manylinuxcxx11-abi-builder
git grep manylinuxcxx11
```
Return no results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164187
Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/seemethere
2025-09-30 00:26:20 +00:00
704cd771f6 [PP] Customize pipeline's submod name (#164037)
Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037
Approved by: https://github.com/H-Huang
ghstack dependencies: #164045, #164035
2025-09-29 23:29:52 +00:00
d58f7c3ad1 [Easy] Add pointwise tag to fma (#164149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164149
Approved by: https://github.com/fmassa
2025-09-29 22:40:04 +00:00
170e0309ca Bump protobuf from 5.29.4 to 5.29.5 in /.ci/docker (#156157)
* Bump protobuf from 5.29.4 to 5.29.5 in /.ci/docker

Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 5.29.4 to 5.29.5.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/compare/v5.29.4...v5.29.5)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 5.29.5
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update .ci/docker/requirements-ci.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-29 15:20:44 -07:00
0f619c1f89 Revert "[inductor] do comm compute overlap at aten fx level (#163215)"
This reverts commit c9b5af9a384e7ef5f95613abe1622f5f55133c3a.

Reverted https://github.com/pytorch/pytorch/pull/163215 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
b28e4f1f87 Revert "refactor bucketing (#163754)"
This reverts commit e1bd5b60cf243d3a026a6c89733488a6d9d4b33d.

Reverted https://github.com/pytorch/pytorch/pull/163754 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
84dc54ae5e Revert "Helper to augment graph with additional deps (#163959)"
This reverts commit b5d4d350f573db12b8181ee13f9386d6ef8a1e57.

Reverted https://github.com/pytorch/pytorch/pull/163959 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
50d418f69f Replace setup.py bdist_wheel with python -m build --wheel (#156712)
Previously we already replaced most use of `python setup.py develop/install`.

This PR also replaces the use of `setup.py bdist_wheel` with the modern `python -m build --wheel` alternative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156712
Approved by: https://github.com/atalman
ghstack dependencies: #156711
2025-09-29 21:51:32 +00:00
c332d58184 [testing] upload test stats: Add info to the invoking file summary and some other changes (#164016)
* Changes some internal logic for grouping so hopefully it's slightly less annoying write code for
* Changes the invoking file summary to just use file, which I think is correct most of the time
* Adds some fields to the file summary, like skips, errors, etc so I can reuse it for file report regression things

Output should be the same, maybe with slightly more fields since I got rid of some of the pops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164016
Approved by: https://github.com/huydhn
2025-09-29 21:20:18 +00:00
efd7fd5ed5 Consistently use c10_ovrsource in arvr mode everywhere (#164128)
Summary:
Previously, many arvr targets transitively depended on c10, not c10_ovrsource,
because they either explicitly depended on c10 (because they didn't know
better) or they depended on legacy Caffe2, which never got the ovrsource
treatment.  So we found all these spots (driven by D82283623) and forced them
to query arvr mode to figure out which one they should use.  The goal is you
NEVER have both targets in the same build rule at the same time.

This diff could be reverted if D82224960 works out but I haven't gotten it to work yet.

Test Plan: sandcastle

Reviewed By: EscapeZero

Differential Revision: D82390436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164128
Approved by: https://github.com/albanD, https://github.com/malfet
2025-09-29 20:47:20 +00:00
b5d4d350f5 Helper to augment graph with additional deps (#163959)
In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754
2025-09-29 20:43:12 +00:00
6db1b9dd21 [MPS] Chunk fillBuffer into 4Gb slices (#164108)
To avoid regression on MacOS 26, which one could observe by running the following script
```swift
import Metal

let bufferSize = 1<<32 + 4

guard let device = MTLCreateSystemDefaultDevice() else { fatalError("No Metal device found") }
guard let buffer = device.makeBuffer(length: bufferSize, options: .storageModeShared) else { fatalError("Failed to create buffer") }

guard let cmdQueue = device.makeCommandQueue() else { fatalError("Failed to create command queue") }
guard let cmdBuffer = cmdQueue.makeCommandBuffer() else { fatalError("Failed to create command buffer") }
guard let blitEncoder = cmdBuffer.makeBlitCommandEncoder() else { fatalError("Failed to create blit encoder") }

blitEncoder.fill(buffer: buffer, range: 0..<bufferSize, value: 0x42)
blitEncoder.endEncoding()

cmdBuffer.commit()
cmdBuffer.waitUntilCompleted()

let tailOffs = 8
let hostPtr = buffer.contents().bindMemory(to: UInt8.self, capacity: bufferSize)
let tail = Array(UnsafeBufferPointer(start: hostPtr + (bufferSize - tailOffs), count: tailOffs))

for (idx, val) in tail.enumerated() {
    print("Offs 0x\(String(bufferSize - tailOffs + idx, radix: 16)): 0x\(String(val, radix: 16))")
}
```

Test plan: run `test_indexing.py` on MacOS-26

Fixes https://github.com/pytorch/pytorch/issues/161265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164108
Approved by: https://github.com/Skylion007
2025-09-29 20:19:29 +00:00
9e792f583a Revert "[export] Skip the check instead of disable (#164084)"
This reverts commit c2768d0f5af840a94c342ed9eac3e26c819aa3f0.

Reverted https://github.com/pytorch/pytorch/pull/164084 on behalf of https://github.com/yangw-dev due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/164084#issuecomment-3348862668))
2025-09-29 20:09:13 +00:00
6650f5af74 Revert "[dynamo] Special path for cloning of torch dispatch tensors (#164081)"
This reverts commit 811c693c49f7cd3da2ea174955d12f2f8780bd46.

Reverted https://github.com/pytorch/pytorch/pull/164081 on behalf of https://github.com/yangw-dev due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/164084#issuecomment-3348862668))
2025-09-29 20:09:13 +00:00
349c960970 Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956)
Workaround for https://github.com/pytorch/pytorch/issues/163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163956
Approved by: https://github.com/malfet

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-09-29 19:38:17 +00:00
f090818a40 Rename remaining periodic and xpu workflows py3.9->py3.10 (#164127)
Fix naming py3.9 should be py 3.10
These jobs where already migrated to 3.10
Please see: https://github.com/pytorch/pytorch/actions/runs/18091356163/job/51472526131#step:16:224

```
Python version:
+ python --version
Python 3.10.18
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164127
Approved by: https://github.com/malfet
2025-09-29 19:26:21 +00:00
e1bd5b60cf refactor bucketing (#163754)
Preparatory refactory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #163215
2025-09-29 18:32:41 +00:00
c9b5af9a38 [inductor] do comm compute overlap at aten fx level (#163215)
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
2025-09-29 18:18:03 +00:00
604da4bb9a [Inductor-FX] Support unbacked symbol definitions (#163729)
# Problem
Inductor sometimes generates unbacked symints to handle things like mismatched branches of `torch.cond`. This code is represented by `pytree.KeyPath`, with special codegen logic to convert it to Python and C++. This was not previously supported by the FX backend.

# Feature
This PR adds support for unbacked symbol declarations to the FX backend. The implementation is fairly straightforward.
1. Instead of raw Python/C++, update the wrapper codegen method to emit a new Wrapper IR line called `UnbackedSymbolDefsLine`. This contains all the information needed to  generate the Python and C++ code.
2. Move the existing Python/C++ codegen to a private method, which is invoked by `UnbackedSymbolDefsLine.codegen()`.
3. Implement a method to generate FX IR from unbacked symbol definitions. The implementation is based on recursive descent, consuming some keypath entries, emitting an FX IR node, and recursing to the rest of the keypath. It is conceptually identical to the existing algorithm for Python and C++, except it generates FX nodes.
4. The FX backend currently relies on size hints to generate autotuning arguments, and consequently autotuning does not support unbacked SymInts. At some point, we would like to generalize the autotuning logic to support these. But for now, simply emit a warning and skip autotuning when we see them.
5. The new test case exposed some tricky issues reconciling Triton call args with constants stored in `triton_meta`. This PR rewrites the relevant helper function to do this in a more principled way.

# Test plan
This PR imports an existing control flow test to the FX backend's test suite. The test uses unbacked symbol definitions to handle mismatched dynamic shapes coming from `torch.cond` branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163729
Approved by: https://github.com/jansel
2025-09-29 18:10:37 +00:00
8f32adc90a [MPSHooks] Release pending command encoder (#164093)
Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error
```
 tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer'
```

Added regression test to `test_mps_extension`

Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that

Fixes https://github.com/pytorch/pytorch/issues/163721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-09-29 17:50:12 +00:00
3fa3bfbfda [EZ][BE] Fix unused parameter warnings in EmbeddingBag (#164135)
Before this change following were emitted during compilation
```
[7/31] Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal to EmbeddingBag_31.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:28:12: warning: unused parameter 'is_first' [-Wunused-parameter]
      bool is_first) {
           ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:47:16: warning: unused parameter 'per_sample_weights_index' [-Wunused-parameter]
      uint32_t per_sample_weights_index,
               ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:48:19: warning: unused parameter 'per_sample_weights' [-Wunused-parameter]
      constant T* per_sample_weights,
                  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:49:16: warning: unused parameter 'per_sample_weights_stride' [-Wunused-parameter]
      uint32_t per_sample_weights_stride) {
               ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:74:19: warning: unused parameter 'weight_val' [-Wunused-parameter]
      opmath_t<T> weight_val,
                  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:75:19: warning: unused parameter 'out_val' [-Wunused-parameter]
      opmath_t<T> out_val,
                  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:76:12: warning: unused parameter 'is_first' [-Wunused-parameter]
      bool is_first,
           ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:77:17: warning: unused parameter 'max_idx' [-Wunused-parameter]
      thread I& max_idx,
                ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:78:9: warning: unused parameter 'weight_idx' [-Wunused-parameter]
      I weight_idx,
        ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:79:12: warning: unused parameter 'pad' [-Wunused-parameter]
      bool pad) {}
           ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164135
Approved by: https://github.com/Skylion007
2025-09-29 17:44:09 +00:00
8701f18bc0 Adjust ...mark_unbacked() -> ...decorators.mark_unbacked() in logs. (#164131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164131
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-29 17:44:00 +00:00
a56e7a1920 [Max Autotune][B200] Add addmm config to avoid test OOM (#164020)
Summary: Add a new `addmm` config that is small enough to not cause an OOM (out of memory error), since the configs for `blackwell_persistent_mm_configs`, which `addmm` used, are too large.

Test Plan: `test_max_autotune.py`

Differential Revision: D83378477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164020
Approved by: https://github.com/coconutruben, https://github.com/njriasan
2025-09-29 17:38:46 +00:00
e2c894c97d [Inductor][ATen][FP8] Relax stride check for block-wise scaling when scaling dimension is 1 (#163829)
Summary: Relax stride check for block-wise scaling (1x128, 128x128) when a dimension of the scaling factor is 1. When the scaling tensor has a dimension of size 1, the stride is effectively "meaningless" to PyTorch, i.e. PyTorch decides to replace its stride with a default of `[1, 1]`. However, the old stride check required the stride to match one of the scaling dimensions. Here, we relax the stride check when the effective stride is 1 in order to allow for cases in which `K <= 128` and `N <= 128`.

Test Plan:
```
pytest -s -v test/test_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_1_rhs_block_128_cuda   2>&1 | tee ~/personal/stride_check.log
```

Differential Revision: D83023706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163829
Approved by: https://github.com/lw, https://github.com/eqy
2025-09-29 17:28:26 +00:00
6b473c90cf Revert "[inductor] require shape in TritonCSEVariable (#162275)"
This reverts commit c257570e6cd25753f9f0a640b965148ead2cf918.

Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/jeffdaily due to sorry this broke rocm CI; inductor/test_select_algorithm.py::TestTemplateRender::test_finalized_subclass_hooks [GH job link](https://github.com/pytorch/pytorch/actions/runs/18048893250/job/51366715091) [HUD commit link](c257570e6c) ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3348159095))
2025-09-29 17:26:54 +00:00
6bcc6bbc85 [Inductor][FP8] Add op_name for ScaledMM TMA template heuristic (#164019)
Summary: For H100s and below, add `op_name="scaled_mm"` to the template heuristic for `CUDAScaledTMATemplateConfigHeuristic` such that `scaled_mm` persistent + TMA tests do not default to the "mm" heuristics.

Test Plan: `test_max_autotune.py`

Differential Revision: D83390775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164019
Approved by: https://github.com/njriasan
2025-09-29 17:24:26 +00:00
95be302889 Skip test_conv3d_cudnn_broken on ROCM (#164138)
Followup after https://github.com/pytorch/pytorch/pull/163903  Fixes https://github.com/pytorch/pytorch/issues/164137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164138
Approved by: https://github.com/Camyll
2025-09-29 16:56:51 +00:00
f433e681b9 Remove export of slice_in_dim (#164117)
Cannot find `slice_in_dim` in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164117
Approved by: https://github.com/soulitzer
2025-09-29 16:56:14 +00:00
5ff2387dbe Fix comment on broadcasting example to clarify dimension mismatch (#162177)
Fixes #162116

Updated the comment in the broadcasting example to clarify that tensors with mismatched dimension sizes (0 vs 2) are not broadcastable. Removed incorrect reference to missing dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162177
Approved by: https://github.com/soulitzer
2025-09-29 16:47:48 +00:00
84b57c93db [MPSInductor] Unskip test_repeat_interleave_Tensor_decomp (#164136)
Not sure what was the problem, but it passes for me locally

Fixes https://github.com/pytorch/pytorch/issues/159408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164136
Approved by: https://github.com/v0i0
2025-09-29 16:20:34 +00:00
069ccf5f1e [inductor] pdl: enable launch and deduplicate waits (#162014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162014
Approved by: https://github.com/eellison
2025-09-29 16:10:26 +00:00
1c12d7416b [SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598)
Fixes #163597

- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.

### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)

Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro

### Vector Fast Path (q_len=1, k_len=256)

- Before: 0.160 ms
- After: 0.157 ms

### Vector 2-pass (q_len=1, k_len=4096)

- Before: 0.342 ms
- After: 0.339 ms

### Vector Fast Path (q_len=8, k_len=256)

- Before: 0.228 ms
- After: 0.231 ms

### Vector 2-pass (q_len=8, k_len=4096)

- Before: 0.432 ms
- After:  0.436 ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet
2025-09-29 16:09:46 +00:00
3746039b47 [inductor] fix: 'get_raw_stream' undefined (#163707)
Summary:
ran into this when precompiling baidu/ERNIE-4.5-21B-A3B-PT

codegen after fix:
```py
import triton
import triton.language as tl
from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
from torch._C import _cuda_getCurrentRawStream as get_raw_stream
with torch.cuda._DeviceGuard(0):
    stream0 = get_raw_stream(0)
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163707
Approved by: https://github.com/jamesjwu
2025-09-29 15:48:16 +00:00
872edd89d6 Enable outer reductions in fbcode (#163884)
Summary: Enabling the outer reduction optimization in fbcode

Test Plan: Evals in https://docs.google.com/document/d/1-tcItRsyEaibaXL56Zq2-CWh5wCmHXDDgDQT_9uOvXE/edit?tab=t.0#bookmark=id.tkgzaitxacg0

Differential Revision: D81948542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163884
Approved by: https://github.com/Skylion007
2025-09-29 15:25:17 +00:00
47ed41109f Fix PgNccl coalseced profiling (#160680)
Admittedly I'm a noob when looking at traces, but this looked pretty off to me:
<img width="1528" height="824" alt="Screenshot 2025-08-14 at 5 27 49 PM" src="https://github.com/user-attachments/assets/871e7b4c-0e47-4c84-97cc-8198b7b76d4b" />
1. Why are there so many "nccl:coalesced" on the CPU thread
2. Why is there "nccl:coalesced" on compute stream (stream 7)

Here is what is happening:

**CPU side**: In `endCoalescing`, we create a [work object ](3be70dc30e/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3473)) with the profiling title "nccl:coalesced"
**GPU side**: The CUDA kernels will inherit this profiling title

What is missing:

We forgot to call the record function [callback](3be70dc30e/torch/csrc/distributed/c10d/Work.cpp (L35-L38)). With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this:

<img width="1123" height="637" alt="image" src="https://github.com/user-attachments/assets/f015fd64-85cd-452a-be24-3e7724f84e44" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160680
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-09-29 15:21:55 +00:00
fa54b08cd5 Replace setup.py install with pip install (#156711)
#156027 already replaced most use of `python setup.py install`.
This PR only adds a few more occurrences and adds `--no-build-isolation` in a few places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156711
Approved by: https://github.com/atalman
2025-09-29 15:15:10 +00:00
92284fb2ff Add SVE128 ISA (#158932)
Summary: Partly Importing and adapting https://github.com/pytorch/pytorch/pull/138388, adding SVE128 as ISA.

Intention is to add SVE128 translation layers for Vectorized data types.
Idea is to have 1 PR per file, aside from the current one, plus a last one modifying cmake files to enable the new ISA selectively.

Tested current changes on a nightly run, to verify no regressions occur on systems leveraging SVE256.

No regressions spotted when running test_ops.py, a set of 34k unit tests. A machine leveraging SVE128 was used towards this testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158932
Approved by: https://github.com/malfet
2025-09-29 14:49:19 +00:00
84d673ef57 Add less warps config to inner reductions (#162447)
Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread

<img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" />

Differential Revision: [D83343892](https://our.internmc.facebook.com/intern/diff/D83343892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
2025-09-29 13:48:36 +00:00
d633bac252 Update issue templates adding a DISABLE AUTOREVERT option (#163858)
This should be used to disable autorevert functionality if users feels the need to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163858
Approved by: https://github.com/izaitsevfb
2025-09-29 13:10:05 +00:00
d81476e211 [xla hash update] update the pinned xla hash (#163494)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163494
Approved by: https://github.com/pytorchbot
2025-09-29 12:31:16 +00:00
a0ae2f9aa0 Update slow tests (#163493)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163493
Approved by: https://github.com/pytorchbot
2025-09-29 11:58:17 +00:00
615da7b95e [fx] Allow customization of submod name in split graph (#164035)
Fixes #164030: HOP and pipelining both name things submod_i
by adding an optional argument `partition_affix` to `split_module` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164035
Approved by: https://github.com/ezyang
ghstack dependencies: #164045
2025-09-29 09:16:36 +00:00
4fd70d4e7b [1/N]Enable some tests in test_ops.TestCommon on Intel GPU (#159944)
For https://github.com/pytorch/pytorch/issues/114850, we will port aten unit tests to Intel GPU. This PR will work on some test case of test/test_ops.py. We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. Extended XPUTestBase.get_all_devices to support multiple devices
2. Added skipXPU decorator
3. Extended onlyOn to support device list
4. Enabled 'xpu' for some test pathes
5. Added allow_xpu=True for supported test class.
6. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
7. Use skipIfXpu and skipXPU to disable unsupported test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159944
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-09-29 09:08:04 +00:00
e1e5e040cd [dynamo][export] Add some missing trace rules (#164080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164080
Approved by: https://github.com/tugsbayasgalan
2025-09-29 08:47:24 +00:00
5ddad22196 [PP] Use default export mode (non-strict) (#164045)
export's default mode has switched from strict to non-strict. We just follow suit in PP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164045
Approved by: https://github.com/H-Huang
2025-09-29 06:31:06 +00:00
90512fa5bd [Quant] extend the op list for quant lift up (#163621)
Add `aten.reshape.default` into the op list of quant lift up, in order to fuse more potential quantized kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163621
Approved by: https://github.com/mingfeima, https://github.com/Xia-Weiwen, https://github.com/jansel
2025-09-29 06:14:45 +00:00
48a5470cf8 [CUDA] fix indexing on large tensor causing nvalid configuration argument (#164049)
Fixes #164048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164049
Approved by: https://github.com/eqy
2025-09-29 06:07:35 +00:00
b9854c9d89 [Inductor][CPP] Fix the test case of test_linear_reuse_kernels (#163723)
Fixes #163491.
Add tolerances to make `test_linear_reuse_kernels` more stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163723
Approved by: https://github.com/leslie-fang-intel
2025-09-29 05:29:01 +00:00
eb4361a801 [Fix] Adding missing f prefixes to formatted strings [1/N] (#164065)
As stated in the title.

* #164068
* #164067
* #164066
* __->__ #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065
Approved by: https://github.com/Skylion007
2025-09-29 04:53:00 +00:00
d131f213ac [vllm hash update] update the pinned vllm hash (#164092)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164092
Approved by: https://github.com/pytorchbot
2025-09-29 04:41:06 +00:00
7c7ae86991 [Fix] Adding missing f prefixes to formatted strings [2/N] (#164066)
As stated in the title.

* #164068
* #164067
* __->__ #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164066
Approved by: https://github.com/Skylion007
2025-09-29 04:40:44 +00:00
ad32ed83b3 [Fix] Adding missing f prefixes to formatted strings [3/N] (#164067)
As stated in the title.

* #164068
* __->__ #164067
* #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164067
Approved by: https://github.com/Skylion007
2025-09-29 04:35:23 +00:00
d8becd1cf4 [dynamo][export] Make the source_stack and fqn info same between dynamo and export (#164085)
preparing for landing the install_free_tensors flag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164085
Approved by: https://github.com/tugsbayasgalan
2025-09-29 04:35:13 +00:00
e64dd8c694 [Fix] Adding missing f prefixes to formatted strings [4/N] (#164068)
As stated in the title.

* __->__ #164068
* #164067
* #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164068
Approved by: https://github.com/Skylion007
2025-09-29 04:07:07 +00:00
047ae24e34 Eliminate setup.py install/develop in the codebose (#162329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162329
Approved by: https://github.com/ezyang
2025-09-29 03:54:28 +00:00
3cda34ebde [2/N] Apply ruff UP035 check in torch files (#164054)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
`TypeAlias` is also imported from `typing`.
This PR is the follow-up of #163947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164054
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-09-29 03:35:32 +00:00
352197c508 Remove old ROCm skip conditions in tests (#164058)
This PR removes skip conditions for ROCM <= 3.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164058
Approved by: https://github.com/kwen2501
2025-09-29 03:00:58 +00:00
811c693c49 [dynamo] Special path for cloning of torch dispatch tensors (#164081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164081
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #164084
2025-09-29 01:44:44 +00:00
c2768d0f5a [export] Skip the check instead of disable (#164084)
Its unclear why we had disable in the first place. With
install_free_tensors, we are tracing into this hook. A better way would
be to place the tracer without any hook. For now, disable the checking
while dynamo is tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164084
Approved by: https://github.com/tugsbayasgalan
2025-09-29 01:44:44 +00:00
a8c528c105 [1/N] Apply UP035 rule in tests (#163947)
Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163947
Approved by: https://github.com/ezyang
2025-09-29 01:42:01 +00:00
dc54ce7554 [hops] Support unspecialized nn module for export hops (#164082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164082
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #164079
2025-09-29 01:34:10 +00:00
1981ed4f60 [dynamo][logging] Add to param_count only if metrics_count is active (#164079)
This is rare but happens with executorch tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164079
Approved by: https://github.com/tugsbayasgalan
2025-09-29 00:59:18 +00:00
54b38f3b46 Add operator benchmarking run to CI nightly (#162530)
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.

**Benchmark Configuration and Coverage:**

* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-29 00:46:38 +00:00
bc5a072ebf fixes import error 'functionalize' from functorch (#163746)
Fixes #163637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163746
Approved by: https://github.com/malfet
2025-09-28 23:16:45 +00:00
d1b3481131 registraion replaced with registration in jit_type.h file comment (#164072)
Fixes #164071

typo correction done
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164072
Approved by: https://github.com/Skylion007
2025-09-28 22:55:24 +00:00
3766513d25 Remove C++ workarounds for Python < 3.10 (#164055)
Remove two unnecessary `PY_VERSION_HEX` branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164055
Approved by: https://github.com/ezyang
2025-09-28 20:00:02 +00:00
ea6846b231 [CI] Remove the unnecessary workflow related functorch (#162581)
The [docs](https://docs.pytorch.org/functorch/stable/) about `functorch` has been migrated into [PyTorch Doc](https://docs.pytorch.org/docs/stable/func.html) since PyTorch 2.0, so I think we can remove it right now to reduce the compute resources usages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162581
Approved by: https://github.com/ezyang
2025-09-28 19:56:20 +00:00
f6537d9616 Move control flow export tests to new tracer (#163259)
Differential Revision: [D82732614](https://our.internmc.facebook.com/intern/diff/D82732614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163259
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #163136, #163137, #163258
2025-09-28 19:56:09 +00:00
cc0332563e Use new_tracer_experimental for torchao strict export (#163258)
Export team is fixing up the old strict export implementation, as a result it fails a check where we proxy the whole module under given directories. _WrapperModule is a way for torchao to workaround the issue where export requiring nn.module to trace so it should never get proxied in the graph.

Differential Revision: [D82732613](https://our.internmc.facebook.com/intern/diff/D82732613)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163258
Approved by: https://github.com/anijain2305
ghstack dependencies: #163136, #163137
2025-09-28 19:55:54 +00:00
8239ba4087 Fix various bugs in subclass input in export (#163770)
This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors.

Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163770
Approved by: https://github.com/avikchaudhuri
2025-09-28 18:03:32 +00:00
1fdd99de71 Building guards should be under metrics_context (#163967)
Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163967
Approved by: https://github.com/avikchaudhuri
2025-09-28 16:28:34 +00:00
38ed608956 Better error handling in torch/nativert/* (#163308)
Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/***

The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in
torch/nativert/*   .

Fixes part of #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163308
Approved by: https://github.com/dolpm
2025-09-28 14:23:44 +00:00
238dc65368 [ROCm] use hipSolver instead of MAGMA for Cholesky (#163977)
Currently, the Cholesky factorization and least squares operation defaults to magma when Pytorch is compiled for ROCm. This shows suboptimal performance.
This change allows PyTorch to rely on hipSolver instead of Magma.
@jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163977
Approved by: https://github.com/Skylion007
2025-09-28 06:53:06 +00:00
7bbde0c094 Remove unused argument from DEFINE_BINARY macro. (#163868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163868
Approved by: https://github.com/Skylion007
ghstack dependencies: #163822
2025-09-28 06:32:41 +00:00
dfcab0e7e1 Handle DDE in infer_size_impl (#163822)
hit this while running VLLM with unbacked for model Qwen/Qwen2-1.5B-Instruct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163822
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2025-09-28 06:32:41 +00:00
1cc9263f52 [vllm hash update] update the pinned vllm hash (#164053)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164053
Approved by: https://github.com/pytorchbot
2025-09-28 04:35:17 +00:00
c2862c8e66 [distributed] Remove python code older than 3.10 (#163613)
Because now that the minimum Python version is 3.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163613
Approved by: https://github.com/XuehaiPan, https://github.com/kwen2501
2025-09-28 04:15:24 +00:00
b377c9e365 graph break on tolist if capture_scalar_outputs is false (#163807)
address https://github.com/pytorch/pytorch/issues/163798

its problematic to not graph break because:
1. break current contract.
2. well dynamo trace then we have .item call then if we ever re-trace later in autograd for example we hit a
 failure (We do not know where to graph break at that point)! see the added unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163807
Approved by: https://github.com/bobrenjc93
2025-09-28 04:02:52 +00:00
3059b08012 [inductor] add subsystem to pattern matcher (#163922)
Summary:
Running a toy example through `torch.compile(fullgraph=True, backend="inductor")` with default inductor config, I tried to see what passes are run in each of pre-grad, joint-graph, and post-grad phases by printing out the subsystem in `GraphTransformObserver`. However the subsystem showed up as None in a bunch of transforms that were run in each of those phases, so this PR adds some additional annotations.

Note that these annotations are probably not a complete set, since other transforms may run based on changes to the config that are not covered here.

Hopefully this doesn't change behavior. However, I did notice that bisecting relies on disabling various phases, which means that while before some passes would *not* be disabled (because their subsystem was `None`), now they would.

Test Plan: existing tests + manual test described in summary

Differential Revision: D83306676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163922
Approved by: https://github.com/jansel
2025-09-28 03:15:23 +00:00
5504a06e01 [BE]: Update NCCL to 2.28.3 (#162351)
@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory.  Also allows FP8 support for non-reductive operations on pre-sm90 devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162351
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman
2025-09-28 01:38:59 +00:00
1ad491dd88 Better error handling in torch/csrc/jit/ir/* (#163757)
Refactor error handling to use TORCH_CHECK for improved clarity in constants and scope management

Fixes some parts of ISSUE #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163757
Approved by: https://github.com/albanD
2025-09-28 01:18:24 +00:00
fd20889d0b Add type annotations to MPS profiler utilities (#163486)
## Summary
- drop the local mypy allow-untyped-defs escape hatch in the MPS profiler helpers
- annotate the context managers and bool helpers so they type-check cleanly

## Testing
- python -m mypy torch/mps/profiler.py --config-file mypy-strict.ini

------
https://chatgpt.com/codex/tasks/task_e_68d0ce4df2e483268d06673b65ef7745
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163486
Approved by: https://github.com/Skylion007
2025-09-27 23:00:53 +00:00
2ce2e48a05 [WIP][symm_mem] Add a wait for signal and put signal for one side API (#159837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159837
Approved by: https://github.com/kwen2501
2025-09-27 21:20:13 +00:00
1d98be6abf [NFC] fixed typo in sparse semi structured filename (#163904)
Make sure all semi structured files use "SparseSemiStructured"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163904
Approved by: https://github.com/Skylion007
2025-09-27 21:19:48 +00:00
dfda239cce [DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement (#163941)
A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

Ideally, we want to add the following test:

```
        with self.assertRaisesRegex(
            RuntimeError, "Any checkpointing related operations are not supported for"
        ):

            dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```

While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163941
Approved by: https://github.com/tianyu-l
2025-09-27 19:50:16 +00:00
991e3d0d16 [dynamo][guards] Revert introduction of different types of lambda_guards (#163385)
With
https://fb.workplace.com/groups/260102303573409/permalink/787294574187510/
issue, it might be a better idea to just speedup _realize_dict and keep
the changes very local. So reverting this PR as well, to return to clean
slate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163385
Approved by: https://github.com/jansel
2025-09-27 18:20:48 +00:00
8f6dbc0ba8 [scan] create fw and bw graphs via partitioning (#162754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162754
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025, #161732
2025-09-27 18:13:15 +00:00
3413490f53 [scan] materialize combine_fn in forward add more autograd tests (#161732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025
2025-09-27 18:13:15 +00:00
b85bee3bbb [hop] refactor check input alias and mutation to be a graph pass (#162025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162025
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808
2025-09-27 18:13:15 +00:00
66dbf2c9f5 [scan][autograd] clone outputs that's aliasing with inputs or outputs in bw (#161808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161808
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664
2025-09-27 18:13:15 +00:00
f5d85874dd [scan][be] remove unnecessary tensor checks (#161664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161664
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #161557
2025-09-27 18:13:14 +00:00
8f15d6a0c9 [test][scan] refactor inductor test and prepare for adding bw tests (#161557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161557
Approved by: https://github.com/zou3519
2025-09-27 18:13:14 +00:00
e78792a70d Update ctc loss docs float32 input required for CuDNN (#162042)
Discovered while working on https://github.com/pytorch/pytorch/pull/159106 the non-obvious requirement that inputs must be float32 to use CuDNN (https://github.com/pytorch/pytorch/pull/159106#issuecomment-3189981705), otherwise the native CUDA implementation is called.

Updates the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162042
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-09-27 18:10:17 +00:00
d9db838f58 [CI] Re-enable test_all_to_all_vdev_2d_offset (#163985)
Fixes https://github.com/pytorch/pytorch/issues/163847
Moving allocations upfront and collectives later. The hang goes away.

My investigation indicates that the hang is inside the last call `torch.testing.assert_close(out_expected, out[:out_numel])`. Rank 3 calls into it, but never gets out. Don't know why yet. I will investigate more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163985
Approved by: https://github.com/fegin
2025-09-27 16:56:25 +00:00
6ba83e06a5 [AMP] Add deprecated decorator for torch.xxx.amp.autocast class (#163654)
As the title stated.

**Changes:**
- torch.cuda.amp.autocast
- torch.cpu.amp.autocast
- add explicit `__new__` and `__init_subclass__` for those class above for inspect.signature to retrieve correct signature

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163654
Approved by: https://github.com/Skylion007
2025-09-27 14:37:12 +00:00
960290d629 [Docs] Add standard-imghdr for PyTorch Doc (#163944)
As the title stated.

Python [Pep-0594](https://peps.python.org/pep-0594) have removed imghdr from python standard libaries, the older version of sphinx don`t add it as installation dependencies, so we need to add it to requirement as an temporary dependencies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163944
Approved by: https://github.com/albanD, https://github.com/svekars
2025-09-27 08:14:51 +00:00
b1a4efc302 [amd] Add cudaHostFn_t to cuda_to_hip_mappings (#164007)
Summary: See title

Test Plan:
```
buck build --flagfile fbcode//mode/opt-amd-gpu fbcode//comms/ctran/algos/common/tests:ctran_algo_gpe_kernel_sync_test
```
After fix: https://www.internalfb.com/buck2/362ff91e-53f2-4b82-9536-cb84c91384a2

Before fix: failed in D83294731 (version 1):
https://www.internalfb.com/sandcastle/workflow/1792432651703947243

Differential Revision: D83375414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164007
Approved by: https://github.com/llxxee
2025-09-27 06:09:50 +00:00
96182faf96 [CI][Distributed][CUDA][Symm-Mem] Enable B200 Symm Mem Test (#162988)
Inspired by https://github.com/pytorch/pytorch/pull/162981 and motivated by https://github.com/pytorch/pytorch/pull/159323 taking a total of 20 hours to finish (and unlikely to make it in short time due to https://github.com/pytorch/pytorch/issues/162178 )

Creating this subtest to get *something distributed* on B200.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162988
Approved by: https://github.com/malfet
2025-09-27 05:12:05 +00:00
dcb8af7501 [torchfuzz] fix bool propagation (#164003)
bools can't propogate through the current pointwise ops such as add/mul. once we add more that can, we'll probably want to add an additional subclass that supports pointwise bools, but for now just don't allow it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164003
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812, #163890, #164002
2025-09-27 04:51:29 +00:00
280e712c13 [vllm hash update] update the pinned vllm hash (#164029)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164029
Approved by: https://github.com/pytorchbot
2025-09-27 04:34:57 +00:00
254d2864d6 Add runtime_overhead PR Time Benchmark (#163866)
This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead.

Example Results:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2025-09-27 03:26:59 +00:00
9dac6437da lint: Filter out /usr/include from results (#164012)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164012
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #164008
2025-09-27 00:54:07 +00:00
8a0e8cad5f lint: Only include files in pytorch (#164008)
We were seeing instances of stdlib files in clang-tidy output so this
just essentially removes them from the things that lintrunner will
report up. Longer term fix here would be to just modify the clang-tidy
configuration in order to do the correct thing here but that requires a
bit more investigation as to why this is only happening in CI and is not
reproduceable locally.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164008
Approved by: https://github.com/ZainRizvi
2025-09-27 00:54:07 +00:00
3a115da3e6 [torchfuzz] ones over zero (#164002)
reduces likelihood of divide by zero errors. long term we'll probably want to just fuzz these values entirely

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164002
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812, #163890
2025-09-27 00:53:02 +00:00
b48a3d0a38 [CuTe] Add layout overlap checking util function in _MeshLayout (#163367)
While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163367
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928, #163930
2025-09-27 00:22:14 +00:00
8d474bdc14 Change python grid calc for MTIA back to python mode (#163601)
Differential Revision: D83000165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163601
Approved by: https://github.com/blaine-rister
2025-09-27 00:12:53 +00:00
008051b13c [Dynamic Shape][BE] trim _DimHint serialization (#163891)
Summary:
current serialization is a bit hard to read
```
Exporting with the dynamic shape spec: {getitem_123: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_118: (_DimHint(type=<_DimHintType.DYNAMIC: 3>,
min=489, max=31232, _factory=False)), getitem_117: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_116: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_115: (
_DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True), _DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_46: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False),
 _DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_ro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False), _DimHint(t
ype=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_nro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False)...
```

Test Plan: UT

Differential Revision: D83175131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163891
Approved by: https://github.com/pianpwk
2025-09-27 00:08:01 +00:00
e4ffd718ec Fix setting of memory fraction in test_garbage_collect_expandable (#164000)
Fixes #160598
Fixes #160551
Fixes #160507

This PR fixes a bug in the `test_garbage_collect_expandable` unit test where the finally block incorrectly re-reads the current per process memory fraction instead of setting the original value. With out the fix the other tests in the `test/test_cuda.py` test suite were impacted and failed with OOM error on ROCm.

This ensures proper cleanup and isolation of test state, maintaining test correctness and avoiding side effects like the below OOM error that it caused.

For example, `test_autocast_checkpointing`  failed with the below error https://github.com/pytorch/pytorch/actions/runs/17982223758/job/51153974194 on ROCm

`torch.OutOfMemoryError: HIP out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 252.97 GiB is free. 1.20 GiB allowed; Of the allocated memory 1.14 GiB is allocated by PyTorch, with 17.00 MiB allocated in private pools (e.g., HIP Graphs), and 18.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164000
Approved by: https://github.com/jeffdaily
2025-09-26 23:57:32 +00:00
ed3085814a [cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903)
cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman
2025-09-26 23:50:09 +00:00
e2817ac204 [cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581)
To workaround #163539

Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581
Approved by: https://github.com/ngimel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-26 23:47:29 +00:00
1d138e658d [AOTI] log error triton kernel name during autotune (#163889)
Summary: can't tell from current error msg which kernel got exception

Test Plan: lint & pyre

Reviewed By: muchulee8

Differential Revision: D83246522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163889
Approved by: https://github.com/jansel
2025-09-26 23:29:49 +00:00
f9095fb285 [Windows] Update libuv version from 1.39 to 1.51 (#160318)
Fixes: [#148315](https://github.com/pytorch/pytorch/issues/148315)

The PR updates `libuv` version as `conda-forge` channel doesn't contain `libuv=1.39` for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160318
Approved by: https://github.com/iremyux, https://github.com/malfet
2025-09-26 23:29:21 +00:00
a0136f149c [MPS] Fix nan behavior in grid_sampler_3d (#163881)
Fixes #163851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163881
Approved by: https://github.com/malfet
2025-09-26 23:08:00 +00:00
62b0ebd8f9 [MPS] [Sparse] unique_dim and sparse broadcast (#163694)
Implements unique_dim, sparse broadcast ops and adds dtypes for mps for tests where we expect to fail, otherwise they would always fail due to being run in double precision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163694
Approved by: https://github.com/malfet
2025-09-26 23:03:13 +00:00
19f16a65b4 [torchfuzz] Add support for fuzz templates (#163890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163890
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812
2025-09-26 22:51:45 +00:00
0ebfa3d7d2 Avoid fast path mask left-align check in compiled TransformerEncoder (#163773)
Fixes #163640

This PR avoids a mask left align check in the case that we're operating under torch.compile / torch.export. Originally, I planned to make a more invasive change to auto-disable the fast path entirely underneath torch.compile / torch.export, but I realized during testing that the fast path wasn't actually causing compile issues outside of the narrow issue identified here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163773
Approved by: https://github.com/mikaylagawarecki
2025-09-26 22:29:37 +00:00
eqy
0ea10f9912 [cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171)
test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-09-26 22:12:17 +00:00
48a852b7ae [AOTI] Update AOTInductor tutorial (#163808)
Summary: Remove the BC breaking warning. Add inductor_config to the example code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163808
Approved by: https://github.com/yushangdi
2025-09-26 22:01:31 +00:00
f1260c9b9a [ROCm][CI/CD] upgrade nightly wheels to ROCm 7.0 (#163937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163937
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 21:42:09 +00:00
28c7d11428 [AOTI] Pass in shape_env for get_stride_order (#163925)
Summary:
As titled.
Without the diff, we got P1963055009

With the diff passing in the enviroment, we can do correct sym_int deduction:
https://fburl.com/mlhub/p5zy7o28

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_sdfpa_unbacked_strides --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Without the fix: P1964887260
With the fix: P1964888579

Differential Revision: D83211018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163925
Approved by: https://github.com/ColinPeppler
2025-09-26 21:10:03 +00:00
a60c6ed99f [DeviceMesh][ez] Extract the pg creation as a util function (#163930)
This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163930
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928
2025-09-26 20:42:58 +00:00
c257570e6c [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
2025-09-26 20:41:12 +00:00
2f85de0b42 Fix preserve annotation with decomp (#163896)
If we use `fx_traceback.preserve_node_meta()`, we will have a few extra node.meta fields on nodes, such as "seq_nr", added from `fx/proxy.py`. As a result, there might be non-empty node.meta on graph nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163896
Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4
2025-09-26 20:28:47 +00:00
e21b037756 Add tests for aot_export_joint_with_descriptors annotation (#163893)
As title, test

1) Annotation works with aot_export_joint_with_descriptor API
2) Annotation works with the 2 step "strict export.export + aot_export_joint_with_descriptor"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163893
Approved by: https://github.com/SherlockNoMad
2025-09-26 19:25:44 +00:00
f8c7505855 [inductor] Fix unbounded number of substitutions when equality checks contain Max expr (#163685)
## Issue

From an internal use case, we found that if we have an equality rule like:

```
Max(15, u0) == s0 * Max(15, u0)
```

This would lead to wrong substitution rule being generated in the substitution table, the result would be the process got stuck in the substitution loop as if it hangs indefinitely, as it's doing the following substitutions:

```
Max(15, u0)
--> s0 * Max(15, u0)
--> s0 ** 2 * Max(15, u0)
--> s0 ** 3 * Max(15, u0)
--> s0 ** 4 * Max(15, u0)
...
```

The root cause is with SymPy expression comparison: as `Max` is [not inside the op class table](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L50-L86), it'll take the [UNKNOWN](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L120) order, and considered bigger than any other types of expressions.

## Fix
1. Added a breaking-out from the substitution while-loop to warn about any exccessive substitutions, what threshold should be used here and how to pass it are open to suggestion, using a hard-coded static value to be simple for now
2. Enhanced the sympy expression comparison logic, so that we first check if one expr "has" the other one or not, to help work around the issue with `Max` here

## Testing

- with the unittiest alone --> unittest stuck
- with the unittest and while-loop breakout, we could see tests finished with warning "**Substitution limit reached**":
```
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] |     !!!   WARNING   !!!    |
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6947s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda W0923 13:00:39.633000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ u1**30*Max(15, u0)
W0923 13:00:39.679000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ 64*u1**30*Max(15, u0)
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6278s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]

============================ 2 passed, 1 skipped, 870 deselected in 19.66s ============================
```

- with the unittest + comparison logic enhanced, we don't see the warning any more:
```
Running 3 items in this shard

test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:15:39.560000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] |     !!!   WARNING   !!!    |
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:15:39.562000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.6093s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.0502s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]

============================ 2 passed, 1 skipped, 870 deselected in 21.99s ============================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163685
Approved by: https://github.com/jansel
2025-09-26 18:46:36 +00:00
425ea90f95 [testing] Add test owner labels for some cuda? tests (#163296)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163296
Approved by: https://github.com/eqy, https://github.com/msaroufim
2025-09-26 18:26:56 +00:00
5b764267f4 [testing] Add test owner labels for some distributed tests (#163174)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163174
Approved by: https://github.com/ezyang
2025-09-26 18:19:04 +00:00
50c0550f5a Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527)
See comment on the macro definition. In short, pybind11 3.x
added `py::native_enum`, and also had to add overhead for that new way
to bind enums on the critical path for calling functions that take
regular old `py::enum_`s as arguments (for example, `__eq__`).

Differential Revision: [D82873169](https://our.internmc.facebook.com/intern/diff/D82873169/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163527
Approved by: https://github.com/ezyang
2025-09-26 17:59:22 +00:00
d7491fb1c1 Fix tensor creation with empty names crash (#163957)
Partially fixes #148324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163957
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-09-26 17:41:00 +00:00
9534c59311 [Inductor] address comments from https://github.com/pytorch/pytorch/pull/163803 (#163901)
Summary: address comments from https://github.com/pytorch/pytorch/pull/163803

Differential Revision: D83291637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163901
Approved by: https://github.com/desertfire
2025-09-26 17:18:44 +00:00
5880996b4c Expose torch.nn.utils.parametrize (#163835)
`torch.nn.utils.parametrize` is not imported from `torch/nn/utils/__init__.py`, thus is not exposed and make it hard for code editors to statically analyze the code and provide auto-completion based on the function signature.

<img width="615" height="292" alt="Screenshot 2025-09-25 at 12 01 52 PM" src="https://github.com/user-attachments/assets/a276f6f0-87f3-4732-943d-2a92ea871974" />

after the fix:

<img width="964" height="393" alt="Screenshot 2025-09-25 at 12 02 16 PM" src="https://github.com/user-attachments/assets/ca47f09e-dc4e-4420-a2d2-11669e07471a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163835
Approved by: https://github.com/albanD
2025-09-26 16:38:18 +00:00
1d26eb0fcc Move inductor.aot_compile to use new tracer (#163137)
Differential Revision: [D82603768](https://our.internmc.facebook.com/intern/diff/D82603768)

I feel no one probably uses this API now but still useful path for more test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163137
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #163136
2025-09-26 15:54:24 +00:00
a05f6ecfec Fix bug with renaming submodules in dynamo for new tracer (#163136)
Differential Revision: [D82603767](https://our.internmc.facebook.com/intern/diff/D82603767)

Previously, i forgot to add handle call_module case which now will have export_root prepended to their names. Basically i want to clean up sth like:
```
graph():
      %l_self_export_root_sub_mod = call_module[target=l_self_export_root_sub_mod](%x, %y)
      %l_self_export_root_sub_mod_1 = call_module[target=l_self_export_root_sub_mod](%x, %y)
  ```

Dynamo graph can have call_module nodes that have messed up name due to our wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163136
Approved by: https://github.com/avikchaudhuri
2025-09-26 15:54:24 +00:00
c106ee8515 [FakeTensor] Supplement the relevant logic for converting conv1d to conv2d in meta_conv (#160408)
## Fixes https://github.com/pytorch/pytorch/issues/159462 also fixes #163569 , #163604

## summary
the issue is caused by the wrong stride of conv1d's result generated by meta_conv:
4d5b3f2d5a/torch/_meta_registrations.py (L2453-L2471)

and the wrong stride will be used to codegen size assert in inductor:
4d5b3f2d5a/torch/_inductor/ir.py (L6152-L6163)

## reason
So why the computed stride is wrong in the meta_conv function? because the corresponding backend will convert conv1d to conv2d and change the input tensor' size and memory_format(channel last). but the meta_conv do not do this transformation, so a mismatch happend.
4d5b3f2d5a/aten/src/ATen/native/Convolution.cpp (L1502-L1510)
 just add corresponding logic in meta_conv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160408
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/mlazos
2025-09-26 15:45:02 +00:00
8aba513506 [MPS] test sparse add MPS dtypes so we get proper expected failure (#163951)
Adds dtypeIfMPS so if op is supported we get proper error like unexpected success. Before we would never get unexpected success because tests were run in torch.double dtype which will always fail on MPS due to it not supporting the dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163951
Approved by: https://github.com/malfet
2025-09-26 14:48:58 +00:00
8c194a367e [DeviceMesh][ez] Add a type alias for backend config (#163928)
Create a type alias for `tuple[Optional[str], Optional[C10dBackend.Options]]` since it is too long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163928
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288
2025-09-26 14:46:53 +00:00
33f3413bd3 [WIP][precompile] Set fake_mode of base tensor in fx graph pickler (#163738)
Summary:
When unpickling a fake tensor in fx graph pickler. It only sets the fake mode of the current tensor's metadata to the one that is consistent with pickler's `unpickle_state`. However, it doesn't set the fake mode of a tensor's base tensor when that tensor is a view.

This will cause an issue when dumping and loading the following graph
```
class GraphModule(torch.nn.Module):
    def forward(self, s77: "Sym(s77)", L_x_: "f32[s77, 8]"):
        l_x_ = L_x_
        chunk = l_x_.chunk(2, dim = -1);  l_x_ = None
        y: "f32[s77, 4]" = chunk[0];  chunk = None
        y_repeat: "f32[s77, 8]" = y.repeat_interleave(2, dim = -1);  y = None
        return (y_repeat,)
```
because `repeat_interleave` will create an intermediate fake tensor of size `[s77, 2, 4]` and it will become the base of the node `y_repeat`'s `meta['val']`.

This causes issues during the deserialization phase when applying AOT precompile to DeepSeek in vLLM.

Test Plan:
This has been tested in vLLM with DeepSeek.

As for unittest, ideally it should be `test_aot_compile_repeat_interleave` with mark_dynamic turned on. However, that's leading to some other pickle issues.

```
python test/dynamo/test_aot_compile.py -k test_aot_compile_repeat_interleave
```

I have yet to figure out a more appropriate unittest. But a proof-of-concept demo would be the following:
```
import inspect
import sympy
import torch
from torch.fx._graph_pickler import GraphPickler
from torch.fx.experimental.symbolic_shapes import ShapeEnv
from torch._subclasses import FakeTensorMode
from torch.fx._graph_pickler import GraphPickler, Options
from unittest.mock import patch

class M(torch.nn.Module):
    def forward(self, x):
        chunk = x.chunk(2, dim=-1)
        y = chunk[0]
        y_repeat = y.repeat_interleave(2, dim=-1)
        return y_repeat

def my_custom_backend(gm, example_inputs):
    global gm_global
    gm_global = gm
    return gm.forward

m = M()
m_opt = torch.compile(m, backend=my_custom_backend, fullgraph=True)

sample_inputs = (torch.randn(2, 8),)
torch._dynamo.mark_dynamic(sample_inputs[0], [0])
opt_out = m_opt(*sample_inputs)

graph_reducer_override = GraphPickler.reducer_override

def _graph_reducer_override(self, obj):
    if (inspect.isclass(obj) and issubclass(obj, sympy.Function)
            and hasattr(obj, "_torch_unpickler")):
        return obj._torch_unpickler, (obj._torch_handler_name, )
    if isinstance(obj, FakeTensorMode):
        return type(None), ()
    return graph_reducer_override(self, obj)

with patch.object(GraphPickler, "reducer_override", _graph_reducer_override):
    pickled_gm = GraphPickler.dumps(gm_global, Options(ops_filter=None))

fake_mode = FakeTensorMode(shape_env=ShapeEnv())
loaded_gm = GraphPickler.loads(pickled_gm, fake_mode)
```

Differential Revision: D83112599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163738
Approved by: https://github.com/zhxchen17
2025-09-26 14:36:37 +00:00
d4e4f70768 Fix overflow in slow_conv3d when kernel size is too large. (#162718)
Also, adding check for padding to avoid segmentation fault caused by overflow.

Fixes #141846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162718
Approved by: https://github.com/jgong5, https://github.com/Skylion007
2025-09-26 13:39:29 +00:00
bfd21cd3e6 Revert "Add less warps config to inner reductions (#162447)"
This reverts commit 768361e67f0eb36491d7b763ef38d7c928ebefe6.

Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to failed to land internally ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3338680532))
2025-09-26 13:16:04 +00:00
7441a1b9b1 Update ruff to 0.13.1 (#163744)
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059)
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.

```

A BC-breaking change is introduced to change the typing of `OrderedSet .storage`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-09-26 10:12:21 +00:00
6a2bd1f4ee [inductor] skip bmm when converting channel last (#159459)
Workaround of #159458 by remove some nodes output channel last set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159459
Approved by: https://github.com/etaf, https://github.com/eellison, https://github.com/shunting314
2025-09-26 09:11:40 +00:00
4783e3ff49 Update torch-xpu-ops commit pin (#163758)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@229e8b](229e8ba104), includes:

- Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL to fix memory leak
- Enable SYCL warnings on Linux
- Fix accuracy issues with CTC loss
- Enable aten::nonzero_static on XPU backend
- Stop recursive calculations in polynomial kernels if tensor has NaNs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163758
Approved by: https://github.com/EikanWang
2025-09-26 09:05:08 +00:00
c8e5b7dabb Add SDPA patterns for T5 variants when batch size is 1 (#163252)
As mentioned in
https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L838, this PR generates patterns  for the cases batch size == 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163252
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-09-26 08:50:06 +00:00
04b51499f7 [CPU] Support transpose and packing fusion for bit8 (#163233)
To be used by CPU INT8 SDPA in TorchAO https://github.com/pytorch/ao/pull/3025. This change has a kernel improvement of about 9%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163233
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-09-26 07:15:04 +00:00
54461a53bd [Inductor] Check if profiling before using record_function in CompiledFxGraph (#163747)
The call to `record_function` adds overhead even if profiling is disabled, which can as much as double the total runtime overhead of a compiled function. #163566 aims to make `record_function` more efficient, but doesn't fully eliminate overhead. This change adds a check if profiling is active before using `record_function`, which avoids this issue all together.

`TestExecutionTrace.test_execution_trace_with_pt2` in https://github.com/pytorch/pytorch/blob/main/test/profiler/test_execution_trace.py#L372 already checks that the `record_function` region is tracked during profiling.

Comparison of the `benchmarks/dynamo/microbenchmarks/overheads.py ` results:

Before Change:
```
requires_grad=False
compiled 56.9us (warmup=10.7s)

requires_grad=True
compiled 99.4us (warmup=0.2s)

inference_mode()
compiled 55.7us (warmup=0.1s)
```

After Change:
```
requires_grad=False
eager    6.9us (warmup=0.0s)
compiled 23.9us (warmup=22.3s)

requires_grad=True
eager    8.7us (warmup=0.0s)
compiled 56.8us (warmup=0.1s)

inference_mode()
eager    6.3us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```

Additionally, #163866 introduces an instruction count benchmark. Because that is not merged and activated yet, here is a comparison:

Before Change:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

After Change:
```
runtime_overhead_inductor,instruction_count,149997
runtime_overhead_inductor_inference_mode,instruction_count,163397
runtime_overhead_inductor_requires_grad,instruction_count,220722
runtime_overhead_inductor_requires_grad_backward,instruction_count,78276
runtime_overhead_inductor_dynamic,instruction_count,161177
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,175495
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,235674
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77475
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163747
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2025-09-26 06:49:40 +00:00
d1403250c9 Fix specialize_impl from triton.runtime.jit (#163844)
Summary:
In https://github.com/triton-lang/triton/pull/7771/ , create_specialize_impl is removed. We extend the support using native_specialize_impl.

Otherwise, PyTorch won't work with trunk triton.

Test Plan:
scripts/lufang/llm/launch_qwen3_vl_235b_a22b_thinking_2507_h100.sh

No more error message like
```
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     specialization = _get_specialization(ordered_args.values())
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     from triton.runtime.jit import specialize_impl as specialize_impl_orig
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf
erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py)
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     specialization = _get_specialization(ordered_args.values())
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     from triton.runtime.jit import specialize_impl as specialize_impl_orig
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf
erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py)
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
```

Differential Revision: D83229128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163844
Approved by: https://github.com/henryoier, https://github.com/davidberard98, https://github.com/BoyuanFeng
2025-09-26 06:37:26 +00:00
b42e81def5 Allow unbacked to unbacked replacements if rhs unbacked symbols are all inputs (#163652)
This partially solve the issue https://github.com/pytorch/pytorch/issues/163641. We do not need to ban unbacked to unbacked replacement if all rhs symbols are inputs since we know those symbols are seen by the whole program.

This issue was found as i was tracing some vllm models with unbacked, namely  Qwen/Qwen2-1.5B-Instruct it makes reasoning logic easier to do those replacements.

as for data dependent similar pattern, I am thinking to create a set of replacements that we apply only during static eval
instead of none. to make reasoning better.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163652
Approved by: https://github.com/bobrenjc93
2025-09-26 06:23:22 +00:00
2a45f30ae7 Exporting aten.conv with cuda under fake mode on a cuda-less machine (#163912)
Summary:
Improve op coverage of exporting a CUDA model on a CPU-only machine under fake tensor mode.

For `torch.nn.functional.conv2d`, it will `_select_conv_backend` based on input and weight shapes.

When calling into `supportsDepthwiseConvolutionWithCuDNN()`, it calls `at::cuda::getCurrentDeviceProperties()` and fails on a CPU-only machine.

So we check if CUDA is actually enabled first.

Test Plan: TORCH_SHOW_CPP_STACKTRACES=1 buck2 run fbcode//caffe2/test:test_export -- --r nn_functional_conv2d

Reviewed By: angelayi, henryoier

Differential Revision: D80562984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163912
Approved by: https://github.com/SherlockNoMad
2025-09-26 06:04:20 +00:00
11b4c0eb9e [aoti] Save compute information (#163792)
Metadata looks like:
```
{
  'AOTI_DEVICE_KEY': 'cpu',
  'AOTI_PLATFORM': 'linux',
  'AOTI_MACHINE': 'x86_64',
  'AOTI_CPU_ISA': 'AVX512',
  'AOTI_COMPUTE_CAPABILITY': '90'
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163792
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #163779
2025-09-26 05:40:44 +00:00
fb93491ddc [aoti] Load metadata w/o loading package (#163779)
Add a function to load the metadata stored in aoti without needing to load the .so. This can be used to store what platform we are compiling the .so on which we can check before loading the .so.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163779
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-09-26 05:40:44 +00:00
39df24fe04 [Code Clean] Replace std::runtime_error with TORCH_CHECK (#163610)
Including:
- `torch/csrc/instruction_counter`
- `torch/csrc/lazy`
- `torch/csrc/monitor`
- `torch/csrc/profiler`
- `torch/csrc/dynamo`

Fixes part of #148114

Personal mistake about (PR #163317), this PR does the same thing **and PR #163317 has already been approved by @albanD.**

This is a personal mistake on my part, and I'm so sorry about that. Hope you won't mind @albanD. 🥹

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163610
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-26 04:52:48 +00:00
bbde16fe98 [vllm hash update] update the pinned vllm hash (#163823)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163823
Approved by: https://github.com/pytorchbot
2025-09-26 04:29:52 +00:00
1b78ca2ef5 [Triton] [Inductor] Prune template selection based on decompose_k (#163781)
Summary:

Triton templates tend to perform very poorly on large K, hence the introduction of decompose_k. As a result, when decompose_k is selected will disable exploring the Triton templates. We may want to consider an override in the future.

Note: Based on the timing results it may be desirable to better refine/prune the decompose k decisions.

Testing:

Tested by looking at the autotune/compilation time using a single shape in TritonBench.
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 python run --op gemm --rep 1000 --sleep 1.0 --m 512 --n 512 --k 300000 --only pt2_matmul_maxautotune`
Before this change:
`SingleProcess AUTOTUNE benchmarking takes 13.5368 seconds and 0.1595 seconds precompiling for 38 choices`
With this change:
`SingleProcess AUTOTUNE benchmarking takes 9.9626 seconds and 0.0020 seconds precompiling for 11 choices`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163781
Approved by: https://github.com/eellison, https://github.com/PaulZhang12
2025-09-26 04:09:35 +00:00
082eaf4aae [DeviceMesh] Add extra check in flatten result cache lookup (#163288)
while refactoring DeviceMesh bookkeeping, we found that there is one corner case which we just don't check whether the dims to be flattened into is same as the dims which an existing flattened name maps to. So we need to add extra cases in the unit test and extra check logic in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163288
Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/fegin
ghstack dependencies: #163212
2025-09-26 03:41:58 +00:00
f1f2e3e4da [DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping (#163212)
DeviceMesh essentially is a way to specify how devices interact with each other or device layout. They are all integers but because they can have various shapes and meshes, it make internal bookkeeping internally way more challenging. Currently our internal bookkeeing inside DeviceMesh is not scalable, so in order to support new functions like `_unflatten`, we need to introduce very complicated logics inside DeviceMesh as pointed out per comment (https://github.com/pytorch/pytorch/pull/159482/files#r2256025452). So thanks to @lw 's suggestion and PoC PR (https://github.com/pytorch/pytorch/pull/160429), we realize that by leveraging CuTe layout algebra([ref](https://docs.nvidia.com/cutlass/media/docs/cpp/cute/02_layout_algebra.html)) from Cutlass will greatly simply our internal mechanical bookkeeping for and make the abstraction ops way easier on top of it. So to make things go incrementally, we propose couple steps here https://github.com/pytorch/pytorch/issues/160337#issuecomment-3195106243.

On top of what we have been doing about PyCute we want to continue add methods into the wrapper class so that we can get rank indexes needed for ProcessGroup Creation with a layout object. We also added detailed explanations and comments (thanks to llm) and unit test to show case the code indeed is working as expected.

More PRs are on the way.

This is a continue of https://github.com/pytorch/pytorch/pull/161016 (originally messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163212
Approved by: https://github.com/ezyang, https://github.com/fegin, https://github.com/lw
2025-09-26 03:32:19 +00:00
67cc0e0ac9 Add Static Dispatch Kernels (#163676) (#163870)
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1951

X-link: https://github.com/pytorch/FBGEMM/pull/4927

Add a few missing static dispatch kernels for remote_ro.

Test Plan: Tested with scripts in D83028841.

Differential Revision: D83258808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163870
Approved by: https://github.com/henryoier
2025-09-26 03:00:07 +00:00
bbf8aa43ef [a2av] Separate in/out splits into two tensors (#163837)
Old signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)`
New signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)`

i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better.
Also to be in line with the 2D version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837
Approved by: https://github.com/fduwjj
ghstack dependencies: #163886
2025-09-26 01:03:54 +00:00
5daa79fd6e Remove dataclass_slots (#163623)
`dataclass` now has `slots` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163623
Approved by: https://github.com/Skylion007
2025-09-26 00:54:42 +00:00
b776e0c71e [ROCm][CI/CD] create ROCm 7.0 magma tarball (#163883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163883
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 00:51:17 +00:00
5c2f09d1f9 [export] _detect_attribute_assignment gives warning instead of raising ValueError (#163809)
Summary:
LSTM was not exportable with non-strict export as it failed at `_detect_attribute_assignment`

This is because the `_flat_weights` attribute in LSTM is a list of registered parameters and will be updated by the `_update_flat_weights` method in `forward`.

However, in `_detect_attribute_assignment`, we manually restore the state of the module by `mod.__dict__.update(snapshot)`. Therefore, it should be fine to turn the `ValueError` into a warning so that RNN models are exportable with non-strict export.

Added test to verify that there is no lifted tensor constant and no fake tensor leakage.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_export_rnn_variants_with_warning

Differential Revision: D83196971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163809
Approved by: https://github.com/tugsbayasgalan
2025-09-26 00:43:29 +00:00
b4be380480 [ROCm] Implement float32 copy kernel (#163869)
* Add `float32_copy_kernel` for vectorizing float16/bfloat16 to float32 conversion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163869
Approved by: https://github.com/jeffdaily
2025-09-26 00:39:30 +00:00
5b8fef3f17 Extend triton_mm auto-tune options for HIM shapes (#163273)
Summary:
Add an option to auto-tune for shape:
```
M=1024 N=171712 K=1024
```

Test Plan:
```
TRITON_PRINT_AUTOTUNING=1 buck2 run mode/opt-amd-gpu -c fbcode.enable_gpu_sections=true //pytorch/tritonbench:run -- --op fp8_gemm_rowwise --no_use_tma --no_use_persistent --m 1024 --n 171712 --k 1024 --bias
```
Before:
 {F1982074581}
After, saw 10%~ boost:
{F1982074585}

Differential Revision: D82687336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163273
Approved by: https://github.com/jananisriram, https://github.com/Camyll
2025-09-26 00:05:57 +00:00
ff2f319e6e [MPS] Fix conv layout handling (#162776)
What started as simple fix for `mps_convolution_backward_input` resulted in a pretty significant refactor/fixes:
- Updated `mps_conv_use_channels_last` to return channels last output if either input or weights are channels last
- Use the same primitive throughout `Convolution.mm` to determine wether output should be allocated in channels last format or not

But doing only those two, resulted in crash in `test_memory_format_nn_Conv2d_mps_float32`, when weights were backward, and bias is present:
```
% python -c "import torch;print(torch.nn.functional.conv2d(torch.rand(2, 4, 3, 4,device='mps'), torch.rand(5, 4, 3, 3,device='mps').to(memory_format=torch.channels_last), torch.rand(5,device='mps')))"
/AppleInternal/Library/BuildRoots/4~B5E4ugDCh2RsPWAjMEoPu8LC5w1yXEwd7XweDhg/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:3619: failed assertion `Error: MLIR pass manager failed'
zsh: abort      python -c
```

Which requires a more thorough redesign/cleanup, namely:
- Do not alter the layout based on MacOS version, but rather do additional copies on MacOS-14 if inputs/output or weight are in channels last format ( done by defining `std::optional<Tensor> output_c;` that contains a contiguous copy of the output tensor
- Introduced `input_suggested_layout` which is set to ChannelsLast if and only if input is channels last and is running on MacOS-15+
- Delete unused `memory_layout` and `group` arguments from `fill_depthwise_conv_desc`
- Fix bias broadcasting logic for channels last

As result, in addition to adding one more regression test this change removes `expectedFailures` from:
- `TestModule.test_memory_format` for `Conv2d`, `ConvTranspose2d`, `LazyConv1d`, `LazyConvTranspose1d`
- `test_require_stride_expanded_dynamic_shapes`
-  `test_mutable_custom_op_fixed_layout2` for MacOS-14

Fixes https://github.com/pytorch/pytorch/issues/161905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162776
Approved by: https://github.com/Skylion007
2025-09-25 23:41:34 +00:00
94195a37ae [BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)
Removes HermeticPyObjectTLS as we no longer need since torch deploy is no longer supported. PythonOpRegistrationTrampoline is also drastically simplified as and being prepped for removal in a future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163464
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-25 23:30:50 +00:00
suo
c58e096cd0 [DTensor] implement logsumexp (#163879)
as title, mostly copypasta from internal. I am a dtensor noob, so please scrutinize my added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163879
Approved by: https://github.com/XilunWu
2025-09-25 23:08:30 +00:00
2a6e6a9e3b [FSDP][Replicate] tests replicate parity for shared parameters (#162836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162836
Approved by: https://github.com/mori360
ghstack dependencies: #162830
2025-09-25 23:08:22 +00:00
6e6c899347 [Reland][163423] Promote @requires_nvshmem instead of enable_triton (#163549)
#163423 was approved but reverted due to a revert of base.
Relanding without base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549
Approved by: https://github.com/wdvr

Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
2025-09-25 23:02:00 +00:00
366961df78 [FSDP][Replicate] tests replicate parity with activation checkpointing (#162830)
**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This tests that replicate function works correctly when combined with activation checkpointing

**Test Case**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_with_activation_checkpointing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162830
Approved by: https://github.com/mori360
2025-09-25 22:57:00 +00:00
520fca82c8 Refactor Provenance Tracking (#163378)
Summary:
- Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code

- Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`.

- If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a  `codegen_comment` call.

- Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling

- Remove `debug_handle` from `call_kernel`.

Test Plan:
CI

```
buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D82839271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378
Approved by: https://github.com/angelayi
2025-09-25 22:55:59 +00:00
908bcfd403 [AOTInductor] Add input information for Triton Kernels in AOTI (#160380)
Summary:
We use record_function to pass in input information to let Kineto show
input information.

Test Plan:
Before:
<img width="459" height="582" alt="Screenshot 2025-09-19 at 10 45 10 AM" src="https://github.com/user-attachments/assets/baa0c251-86e9-49ca-8c6c-fcd2619f7f48" />

After:
<img width="473" height="1130" alt="Screenshot 2025-09-19 at 10 44 53 AM" src="https://github.com/user-attachments/assets/b7942d84-0362-4b9e-9232-14de92bbdd00" />

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160380
Approved by: https://github.com/desertfire
ghstack dependencies: #163593
2025-09-25 22:41:04 +00:00
96275dbf88 [CI] Fix test_triton_wait_until hang (#163886)
I don't know why `nvshmem_barrier_all_kernel`  leads the test to hang. Will investigate.
But since it is an unnecessary call here, I am removing it to unblock other PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886
Approved by: https://github.com/fegin
2025-09-25 22:22:16 +00:00
b14a14a662 [torchfuzz] make generated code much more concise and cleaner (#163812)
```
import torch

torch._dynamo.config.capture_scalar_outputs = True
torch.manual_seed(42)

def fuzzed_program(arg_0, arg_1, arg_2):
    var_node_3 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_4 = torch.full((1,), (-0.29262632146522655-0.7687848816195035j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_2 = torch.ops.aten.add(var_node_3, var_node_4) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_6 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_7 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_5 = torch.ops.aten.add(var_node_6, var_node_7) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_1 = torch.ops.aten.add(var_node_2, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_0 = var_node_1.item() # dtype=complex128
    return var_node_0

arg_0 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))
arg_1 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))
arg_2 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))

args = (arg_0, arg_1, arg_2)
result_original = fuzzed_program(*args)
print(' eager success')
compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True)
result_compiled = compiled_program(*args)
print(' compile success')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163812
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743
2025-09-25 22:12:33 +00:00
92f7361e27 [DTensor] fix uneven _StridedShard (#163843)
Previous uneven `_StridedShard` in https://github.com/pytorch/pytorch/pull/150490 seems failing cases like sharding `tensor = torch.arange(6)` with FSDP 2, TP 2.

This PR attempts to reinvent `_StridedShard`.

I didn't test nested `_StridedShard`, because there shouldn't be any use cases. I think it will become quite messy when it comes to **nested uneven** `_StridedShard`. We are probably going to deprecate it anyway after @zpcore 's work https://github.com/pytorch/pytorch/pull/160266 on ordered sharding, so IMO not worth it to make it too general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163843
Approved by: https://github.com/ezyang
2025-09-25 22:12:29 +00:00
6a6d838832 Add H100 runner to be recognized in actionlint (#163795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163795
Approved by: https://github.com/huydhn, https://github.com/seemethere
2025-09-25 22:09:11 +00:00
183dca423f [Inductor] add a new config fallback_embedding_bag_byte_unpack (#163803)
Differential Revision: D82988783

introduce an inductor config fallback_embedding_bag_byte_unpack so we can have options to not let inductor decompose the op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163803
Approved by: https://github.com/henryoier
2025-09-25 22:07:04 +00:00
b8efa336d2 [torchfuzz] simplify codegen and runner (#163743)
much less code. a followup PR will make these repro files even smaller.
small is important since it reduces the time for users to understand
what the repro is doing. here's a sample:

```
(/home/bobren/local/a/pytorch-env) [21:34] devgpu009:/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz [130] python fuzzer.py --seed 42
Running single fuzz_and_execute...
Using seed: 42, max_depth: 10
Running generated program...
Selected CUDA_VISIBLE_DEVICES=2
=== Program Output ===
 eager success
 compile success

===============================
=== Program Source ===
import torch
import sys
import os
fuzzer_dir = r'/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz'
if fuzzer_dir not in sys.path:
    sys.path.insert(0, fuzzer_dir)
from tensor_fuzzer import fuzz_scalar, fuzz_tensor_simple, ScalarSpec, TensorSpec

def fuzzed_program(arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26):
    # Node node_4: arg (depth 6)
    var_node_4 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_7: constant (depth 4)
    var_node_7 = torch.full((1,), (-0.8353595860703585-0.8384634248041143j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_8: arg (depth 4)
    var_node_8 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_6: tensor_pointwise (depth 5)
    var_node_6 = torch.ops.aten.mul(var_node_7, var_node_8) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_9: constant (depth 5)
    var_node_9 = torch.full((1,), (-0.32478860712861235+0.033909682598544454j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_5: tensor_pointwise (depth 6)
    var_node_5 = torch.ops.aten.mul(var_node_6, var_node_9) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_3: tensor_pointwise (depth 7)
    var_node_3 = torch.ops.aten.sub(var_node_4, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_11: arg (depth 6)
    var_node_11 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_18: constant (depth 0)
    var_node_18 = torch.full((1,), (0.12855308616305575+1.5268033634325642j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_19: arg (depth 0)
    var_node_19 = arg_3 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_17: tensor_pointwise (depth 1)
    var_node_17 = torch.ops.aten.mul(var_node_18, var_node_19) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_21: arg (depth 0)
    var_node_21 = arg_4 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_22: arg (depth 0)
    var_node_22 = arg_5 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_20: tensor_pointwise (depth 1)
    var_node_20 = torch.ops.aten.sub(var_node_21, var_node_22) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_16: tensor_pointwise (depth 2)
    var_node_16 = torch.ops.aten.add(var_node_17, var_node_20) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_25: arg (depth 0)
    var_node_25 = arg_6 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_26: arg (depth 0)
    var_node_26 = arg_7 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_24: tensor_pointwise (depth 1)
    var_node_24 = torch.ops.aten.add(var_node_25, var_node_26) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_27: constant (depth 1)
    var_node_27 = torch.full((1,), (-0.6315711191260084+1.342004076501214j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_23: tensor_pointwise (depth 2)
    var_node_23 = torch.ops.aten.mul(var_node_24, var_node_27) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_15: tensor_pointwise (depth 3)
    var_node_15 = torch.ops.aten.mul(var_node_16, var_node_23) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_28: constant (depth 3)
    var_node_28 = torch.full((1,), (1.064498531874825-0.37289464356501284j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_14: tensor_pointwise (depth 4)
    var_node_14 = torch.ops.aten.mul(var_node_15, var_node_28) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_30: arg (depth 3)
    var_node_30 = arg_8 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_32: arg (depth 2)
    var_node_32 = arg_9 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_33: constant (depth 2)
    var_node_33 = torch.full((1,), (1.5815627438573372+0.5124667911691704j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_31: tensor_pointwise (depth 3)
    var_node_31 = torch.ops.aten.div(var_node_32, var_node_33) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_29: tensor_pointwise (depth 4)
    var_node_29 = torch.ops.aten.div(var_node_30, var_node_31) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_13: tensor_pointwise (depth 5)
    var_node_13 = torch.ops.aten.div(var_node_14, var_node_29) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_39: arg (depth 0)
    var_node_39 = arg_10 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_40: constant (depth 0)
    var_node_40 = torch.full((1,), (-0.5987350493494642-0.5711360569376475j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_38: tensor_pointwise (depth 1)
    var_node_38 = torch.ops.aten.mul(var_node_39, var_node_40) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_41: arg (depth 1)
    var_node_41 = arg_11 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_37: tensor_pointwise (depth 2)
    var_node_37 = torch.ops.aten.add(var_node_38, var_node_41) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_42: constant (depth 2)
    var_node_42 = torch.full((1,), (0.7246044564672116-0.5930730980273312j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_36: tensor_pointwise (depth 3)
    var_node_36 = torch.ops.aten.mul(var_node_37, var_node_42) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_43: constant (depth 3)
    var_node_43 = torch.full((1,), (-0.7582976293117148+1.1880929376258396j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_35: tensor_pointwise (depth 4)
    var_node_35 = torch.ops.aten.mul(var_node_36, var_node_43) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_45: constant (depth 3)
    var_node_45 = torch.full((1,), (1.0896212896322774+0.3124038130417098j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_46: arg (depth 3)
    var_node_46 = arg_12 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_44: tensor_pointwise (depth 4)
    var_node_44 = torch.ops.aten.add(var_node_45, var_node_46) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_34: tensor_pointwise (depth 5)
    var_node_34 = torch.ops.aten.div(var_node_35, var_node_44) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_12: tensor_pointwise (depth 6)
    var_node_12 = torch.ops.aten.div(var_node_13, var_node_34) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_10: tensor_pointwise (depth 7)
    var_node_10 = torch.ops.aten.mul(var_node_11, var_node_12) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_2: tensor_pointwise (depth 8)
    var_node_2 = torch.ops.aten.div(var_node_3, var_node_10) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_48: constant (depth 7)
    var_node_48 = torch.full((1,), (-1.047745491289218+0.279447315087422j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_54: arg (depth 2)
    var_node_54 = arg_13 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_55: arg (depth 2)
    var_node_55 = arg_14 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_53: tensor_pointwise (depth 3)
    var_node_53 = torch.ops.aten.div(var_node_54, var_node_55) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_56: arg (depth 3)
    var_node_56 = arg_15 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_52: tensor_pointwise (depth 4)
    var_node_52 = torch.ops.aten.div(var_node_53, var_node_56) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_59: arg (depth 2)
    var_node_59 = arg_16 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_60: arg (depth 2)
    var_node_60 = arg_17 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_58: tensor_pointwise (depth 3)
    var_node_58 = torch.ops.aten.div(var_node_59, var_node_60) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_61: constant (depth 3)
    var_node_61 = torch.full((1,), (-0.7386327586576402-0.027025998767172658j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_57: tensor_pointwise (depth 4)
    var_node_57 = torch.ops.aten.add(var_node_58, var_node_61) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_51: tensor_pointwise (depth 5)
    var_node_51 = torch.ops.aten.sub(var_node_52, var_node_57) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_64: arg (depth 3)
    var_node_64 = arg_18 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_67: arg (depth 1)
    var_node_67 = arg_19 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_68: constant (depth 1)
    var_node_68 = torch.full((1,), (-0.6840241429755998+1.327637020136433j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_66: tensor_pointwise (depth 2)
    var_node_66 = torch.ops.aten.mul(var_node_67, var_node_68) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_69: arg (depth 2)
    var_node_69 = arg_20 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_65: tensor_pointwise (depth 3)
    var_node_65 = torch.ops.aten.sub(var_node_66, var_node_69) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_63: tensor_pointwise (depth 4)
    var_node_63 = torch.ops.aten.sub(var_node_64, var_node_65) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_70: arg (depth 4)
    var_node_70 = arg_21 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_62: tensor_pointwise (depth 5)
    var_node_62 = torch.ops.aten.sub(var_node_63, var_node_70) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_50: tensor_pointwise (depth 6)
    var_node_50 = torch.ops.aten.mul(var_node_51, var_node_62) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_76: constant (depth 1)
    var_node_76 = torch.full((1,), (1.864651314238342+0.27066487315113186j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_77: arg (depth 1)
    var_node_77 = arg_22 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_75: tensor_pointwise (depth 2)
    var_node_75 = torch.ops.aten.mul(var_node_76, var_node_77) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_78: arg (depth 2)
    var_node_78 = arg_23 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_74: tensor_pointwise (depth 3)
    var_node_74 = torch.ops.aten.add(var_node_75, var_node_78) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_79: arg (depth 3)
    var_node_79 = arg_24 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_73: tensor_pointwise (depth 4)
    var_node_73 = torch.ops.aten.mul(var_node_74, var_node_79) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_80: arg (depth 4)
    var_node_80 = arg_25 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_72: tensor_pointwise (depth 5)
    var_node_72 = torch.ops.aten.mul(var_node_73, var_node_80) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_82: constant (depth 4)
    var_node_82 = torch.full((1,), (1.6341547018841247+0.3096989611326181j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_84: constant (depth 3)
    var_node_84 = torch.full((1,), (0.9609065596935821+0.2920229825681946j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_85: arg (depth 3)
    var_node_85 = arg_26 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_83: tensor_pointwise (depth 4)
    var_node_83 = torch.ops.aten.add(var_node_84, var_node_85) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_81: tensor_pointwise (depth 5)
    var_node_81 = torch.ops.aten.sub(var_node_82, var_node_83) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_71: tensor_pointwise (depth 6)
    var_node_71 = torch.ops.aten.sub(var_node_72, var_node_81) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_49: tensor_pointwise (depth 7)
    var_node_49 = torch.ops.aten.mul(var_node_50, var_node_71) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_47: tensor_pointwise (depth 8)
    var_node_47 = torch.ops.aten.add(var_node_48, var_node_49) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_1: tensor_pointwise (depth 9)
    var_node_1 = torch.ops.aten.add(var_node_2, var_node_47) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_0: torch.ops.aten.item (depth 10)
    var_node_0 = var_node_1.item() # dtype=complex128

    # Final result from root node
    return var_node_0

arg_0 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10042)
arg_1 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10043)
arg_2 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10044)
arg_3 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10045)
arg_4 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10046)
arg_5 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10047)
arg_6 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10048)
arg_7 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10049)
arg_8 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10050)
arg_9 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10051)
arg_10 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10052)
arg_11 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10053)
arg_12 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10054)
arg_13 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10055)
arg_14 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10056)
arg_15 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10057)
arg_16 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10058)
arg_17 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10059)
arg_18 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10060)
arg_19 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10061)
arg_20 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10062)
arg_21 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10063)
arg_22 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10064)
arg_23 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10065)
arg_24 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10066)
arg_25 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10067)
arg_26 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10068)
import torch
import sys
torch._dynamo.config.capture_scalar_outputs = True

args = (arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26)
result_original = fuzzed_program(*args)
print(' eager success')
sys.exit(1)
compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True)
result_compiled = compiled_program(*args)
print(' compile success')

======================
Program exited with code: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163743
Approved by: https://github.com/pianpwk
2025-09-25 21:42:22 +00:00
1cffa42d4d PyTorch histc fix for values with large magnitudes (#163506)
Summary:
The current implementation of the `histc` function on CPU doesn't take into account the nature of the floating point precision represenation when two numbers have very different magnitudes.

In the code of `histc` there is a following logic, which tries to fix an issue when automatically calculated `min` and `max` are identical:
```
if (leftmost_edge == rightmost_edge) {
        leftmost_edge -= 1;
        rightmost_edge += 1;
    }

...

TORCH_CHECK(leftmost_edge < rightmost_edge, "torch.histc: max must be larger than min");
```

But, not for all floating point values expanding the range exactly by 1 will give the representable result that is different from the original value.

The test code:

```
info = th.finfo(th.float32)
f_min = info.min

test_tensor = th.ones((224, 224), dtype=th.float64) * f_min
res = th.histc(test_tensor, bins=10)
```

Actual result:
```
RuntimeError: torch.histc: max must be larger than min
```

Expected result:
Everything should work fine.

NOTICE: If we set `f_min` just to small enough number, code works, which demonstrates the correct purpose of the possible range correction.

In short, `f_min + 1 == f_min` executes to true, since we reach the precision of the floating point prepresentation.
Please notice, this is not limitation of the float32 data type, since all computations happen in float64 (C++ data type `double`). The magnitudes are just different enough, that we reach the precision representation with simple approach of `+/-1`.

Interesting is that `histogram` function doesn't throw an exception, because edges range selection is implemented differently.

The fix we propose is to use `std::nextafter` which returns next representable floating point value starting from the current one in the direction of the lowest or max numbers. In theory, mathecmatically correct is to use this function without constrains, but to maintain backward compatibility in case if there is a code which relies on the current logic of `+/-1` offset we call `std::min` and `std::max` to pick the right representable value (i.e. for small floating point values the next representable value has step smaller than 1 for large values it's larger than 1).
We could stick to `histogram` implementation, but again, to avoid possible backward compatibility breaks, we decided to use the fix presented in this change.

*The real use case scenario:*
In our project we use the well-known transformer version from HuggingFace which fills up the buffer with float32 min (please note this is not a minimal value closer to 0, it's minimal absolute value which is often like `-max`).
The code where it sits is here:
https://github.com/huggingface/transformers/blob/v4.51.1/src/transformers/models/mimi/modeling_mimi.py#L1159

Switching to other version of the transformer will lead to other issues in our project and the bug which we fix here may appear in other projects and scenarios.

The real world problem appears when for such tensor the CPU version of the `histc` is called. In our usecase, it happens because this tensor is an input to the softmax activaiton function and as part of the quantisation the input parameter should go trough the observer as well. In our case the default Histogram observer is selected, which calls the `histc`.

Test Plan:
The simple test code snippet doesn't produce failure:
```
f_min = th.finfo(th.float32).min
test_tensor = th.ones((224, 224), dtype=th.float32) * f_min
th.histc(test_tensor, bins=10)
```

**Testing update:**
The `test_histc` has been updated accordingly.
Now when we have +INF as all values of the tensor, the previous representation of the floating number should be <max_float>, hence the assert message is changed from `[inf, inf]` to `[<max_float>|inf, inf]`.
The test also extended to check the assert message when tensor is filled with values -INF and with combination of (-INF, +INF).
The new regexp assert includes possible output as `inf` and any floating point number in scientific representation for one of the bin edges. We left `inf` as possible value due to possible difference in implementation between CPU and CUDA.

Differential Revision: D82955597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163506
Approved by: https://github.com/jermenkoo, https://github.com/malfet
2025-09-25 20:55:25 +00:00
ebfc87e303 Always produce kernel_info.json (#163715)
Summary: Always produce kernel_info.json so zoomer can use this json to populate GPU traces

Test Plan: CI

Differential Revision: D82762435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163715
Approved by: https://github.com/angelayi
2025-09-25 19:38:49 +00:00
21a41edd4f Add fake_impl for _native_multi_head_attention (#163700)
Test Plan: See added test in test_export.py

Differential Revision: D83099187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163700
Approved by: https://github.com/angelayi
2025-09-25 19:01:27 +00:00
7bad9c5a64 Revert "Update ruff to 0.13.1 (#163744)"
This reverts commit 3dd89a079f2b0c1d39351f98ff5d5ca882523152.

Reverted https://github.com/pytorch/pytorch/pull/163744 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/18016220484/job/51261729375 looks like a landrace with PR that updated min-version to 3.10 ([comment](https://github.com/pytorch/pytorch/pull/163744#issuecomment-3335534084))
2025-09-25 18:54:03 +00:00
151e66e50d Update documentation for torch.index_select (#163616)
Description said "entries in index which is a LongTensor" but index_select can accept an IntTensor as the parameter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163616
Approved by: https://github.com/jbschlosser

Co-authored-by: Joel Schlosser <75754324+jbschlosser@users.noreply.github.com>
2025-09-25 18:29:17 +00:00
b61bdc7cc4 Fix cpp build (#162774)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774
Approved by: https://github.com/malfet, https://github.com/atalman
2025-09-25 18:21:45 +00:00
3dd89a079f Update ruff to 0.13.1 (#163744)
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059)
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-09-25 17:52:35 +00:00
6539537a59 [ROCm][CD] create ROCm 7.0 images for binary builds (#163860)
Adds gfx950.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163860
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-25 17:26:40 +00:00
3cbfbbd691 [ROCm] Transformer/SDPA unit test parity (#163745)
## Major Changes

* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes.
  - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required.
* Fix `atomic_counter` handling in varlen FA API
* Unskips a few unit tests.

Fixes #157120
Fixes #157121
Fixes #157122
Fixes #157167
Fixes #155217
Fixes #157043
Fixes #157060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163745
Approved by: https://github.com/jeffdaily
2025-09-25 17:14:19 +00:00
112e204797 Revert "[CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299)"
This reverts commit 7163dce1e091cb5564c723110314bb372b5e81a8.

Reverted https://github.com/pytorch/pytorch/pull/161299 on behalf of https://github.com/nWEIdia due to Incorrectly suppressing useful warnings when running sm89 binary on sm86 ([comment](https://github.com/pytorch/pytorch/pull/161299#issuecomment-3335127621))
2025-09-25 17:13:32 +00:00
f9821b1be7 DebugMode supports_higher_order_operators=True (#163824)
Make DebugMode supports HOP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163824
Approved by: https://github.com/ydwu4
2025-09-25 17:11:43 +00:00
1317 changed files with 32797 additions and 12581 deletions

View File

@ -15,6 +15,8 @@ fi
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ "$DESIRED_CUDA" == *"13"* ]]; then
export TORCH_NVCC_FLAGS="-compress-mode=size"
# Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801
export BUILD_BUNDLE_PTXAS=1
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

View File

@ -13,49 +13,6 @@ def list_dir(path: str) -> list[str]:
return check_output(["ls", "-1", path]).decode().split("\n")
def build_ArmComputeLibrary() -> None:
"""
Using ArmComputeLibrary for aarch64 PyTorch
"""
print("Building Arm Compute Library")
acl_build_flags = [
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")
if os.path.isdir(acl_install_dir):
shutil.rmtree(acl_install_dir)
if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src", "build"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
@ -356,23 +313,17 @@ if __name__ == "__main__":
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()
print("build pytorch with mkldnn+acl backend")
build_vars += (
"USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
"ACL_ROOT_DIR=/acl "
"LD_LIBRARY_PATH=/pytorch/build/lib:/acl/build:$LD_LIBRARY_PATH "
"ACL_INCLUDE_DIR=/acl/build "
"ACL_LIBRARY=/acl/build "
)
build_vars += "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
build_vars += "ACL_ROOT_DIR=/acl "
if enable_cuda:
build_vars += "BLAS=NVPL "
else:
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/OpenBLAS "
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/opt/OpenBLAS "
else:
print("build pytorch without mkldnn backend")
os.system(f"cd /pytorch; {build_vars} python3 setup.py bdist_wheel")
os.system(f"cd /pytorch; {build_vars} python3 -m build --wheel --no-isolation")
if enable_cuda:
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")

View File

@ -299,40 +299,6 @@ def install_condaforge_python(host: RemoteHost, python_version="3.8") -> None:
)
def build_OpenBLAS(host: RemoteHost, git_clone_flags: str = "") -> None:
print("Building OpenBLAS")
host.run_cmd(
f"git clone https://github.com/xianyi/OpenBLAS -b v0.3.28 {git_clone_flags}"
)
make_flags = "NUM_THREADS=64 USE_OPENMP=1 NO_SHARED=1 DYNAMIC_ARCH=1 TARGET=ARMV8"
host.run_cmd(
f"pushd OpenBLAS && make {make_flags} -j8 && sudo make {make_flags} install && popd && rm -rf OpenBLAS"
)
def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None:
print("Building Arm Compute Library")
acl_build_flags = " ".join(
[
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
)
host.run_cmd(
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"
)
host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")
def embed_libgomp(host: RemoteHost, use_conda, wheel_name) -> None:
host.run_cmd("pip3 install auditwheel")
host.run_cmd(
@ -442,7 +408,7 @@ def build_torchvision(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd vision && {build_vars} python3 setup.py bdist_wheel")
host.run_cmd(f"cd vision && {build_vars} python3 -m build --wheel --no-isolation")
vision_wheel_name = host.list_dir("vision/dist")[0]
embed_libgomp(host, use_conda, os.path.join("vision", "dist", vision_wheel_name))
@ -497,7 +463,7 @@ def build_torchdata(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd data && {build_vars} python3 setup.py bdist_wheel")
host.run_cmd(f"cd data && {build_vars} python3 -m build --wheel --no-isolation")
wheel_name = host.list_dir("data/dist")[0]
embed_libgomp(host, use_conda, os.path.join("data", "dist", wheel_name))
@ -553,7 +519,7 @@ def build_torchtext(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd text && {build_vars} python3 setup.py bdist_wheel")
host.run_cmd(f"cd text && {build_vars} python3 -m build --wheel --no-isolation")
wheel_name = host.list_dir("text/dist")[0]
embed_libgomp(host, use_conda, os.path.join("text", "dist", wheel_name))
@ -614,7 +580,7 @@ def build_torchaudio(
host.run_cmd(
f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \
&& ./packaging/ffmpeg/build.sh \
&& {build_vars} python3 setup.py bdist_wheel"
&& {build_vars} python3 -m build --wheel --no-isolation"
)
wheel_name = host.list_dir("audio/dist")[0]
@ -700,7 +666,6 @@ def start_build(
configure_system(
host, compiler=compiler, use_conda=use_conda, python_version=python_version
)
build_OpenBLAS(host, git_clone_flags)
if host.using_docker():
print("Move libgfortant.a into a standard location")
@ -723,10 +688,12 @@ def start_build(
f"git clone --recurse-submodules -b {branch} https://github.com/pytorch/pytorch {git_clone_flags}"
)
host.run_cmd("pytorch/.ci/docker/common/install_openblas.sh")
print("Building PyTorch wheel")
build_opts = ""
if pytorch_build_number is not None:
build_opts += f" --build-number {pytorch_build_number}"
build_opts += f" -C--build-option=--build-number={pytorch_build_number}"
# Breakpad build fails on aarch64
build_vars = "USE_BREAKPAD=0 "
if branch == "nightly":
@ -743,15 +710,18 @@ def start_build(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:
build_ArmComputeLibrary(host, git_clone_flags)
host.run_cmd("pytorch/.ci/docker/common/install_acl.sh")
print("build pytorch with mkldnn+acl backend")
build_vars += " USE_MKLDNN=ON USE_MKLDNN_ACL=ON"
build_vars += " BLAS=OpenBLAS"
build_vars += " OpenBLAS_HOME=/opt/OpenBLAS"
build_vars += " ACL_ROOT_DIR=/acl"
host.run_cmd(
f"cd $HOME/pytorch && export ACL_ROOT_DIR=$HOME/ComputeLibrary && {build_vars} python3 setup.py bdist_wheel{build_opts}"
f"cd $HOME/pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Repair the wheel")
pytorch_wheel_name = host.list_dir("pytorch/dist")[0]
ld_library_path = "$HOME/acl/build:$HOME/pytorch/build/lib"
ld_library_path = "/acl/build:$HOME/pytorch/build/lib"
host.run_cmd(
f"export LD_LIBRARY_PATH={ld_library_path} && auditwheel repair $HOME/pytorch/dist/{pytorch_wheel_name}"
)
@ -763,7 +733,7 @@ def start_build(
else:
print("build pytorch without mkldnn backend")
host.run_cmd(
f"cd pytorch && {build_vars} python3 setup.py bdist_wheel{build_opts}"
f"cd pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Deleting build folder")
@ -907,7 +877,7 @@ def terminate_instances(instance_type: str) -> None:
def parse_arguments():
from argparse import ArgumentParser
parser = ArgumentParser("Builid and test AARCH64 wheels using EC2")
parser = ArgumentParser("Build and test AARCH64 wheels using EC2")
parser.add_argument("--key-name", type=str)
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")

View File

@ -69,7 +69,8 @@ RUN bash ./install_cuda.sh 13.0
ENV DESIRED_CUDA=13.0
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
ENV MKLROOT /opt/intel

View File

@ -36,6 +36,12 @@ case ${DOCKER_TAG_PREFIX} in
;;
rocm*)
BASE_TARGET=rocm
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$ROCM_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
EXTRA_BUILD_ARGS="${EXTRA_BUILD_ARGS} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)
echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"

View File

@ -84,8 +84,8 @@ fi
_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152
_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
_UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e
_UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77
fi
tag=$(echo $image | awk -F':' '{print $2}')
@ -175,20 +175,6 @@ case "$tag" in
fi
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.4
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-noble-rocm-alpha-py3)
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
ROCM_VERSION=7.0
NINJA_VERSION=1.9.0
TRITON=yes
@ -196,6 +182,9 @@ case "$tag" in
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-jammy-xpu-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
@ -452,12 +441,3 @@ elif [ "$HAS_TRITON" = "yes" ]; then
echo "expecting triton to not be installed, but it is"
exit 1
fi
# Sanity check cmake version. Executorch reinstalls cmake and I'm not sure if
# they support 4.0.0 yet, so exclude them from this check.
CMAKE_VERSION=$(drun cmake --version)
if [[ "$EXECUTORCH" != *yes* && "$CMAKE_VERSION" != *4.* ]]; then
echo "CMake version is not 4.0.0:"
drun cmake --version
exit 1
fi

View File

@ -1 +1 @@
e0dda9059d082537cee36be6c5e4fe3b18c880c0
deb42f2a8e48f5032b4a98ee781a15fa87a157cf

View File

@ -1 +1 @@
v2.27.5-1
v2.27.5-1

View File

@ -1 +1 @@
bbb06c0334a6772b92d24bde54956e675c8c6604
27664085f804afc83df26f740bb46c365854f2c4

27
.ci/docker/common/install_acl.sh Normal file → Executable file
View File

@ -1,16 +1,27 @@
set -euo pipefail
#!/bin/bash
# Script used only in CD pipeline
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary
set -eux
ACL_VERSION=${ACL_VERSION:-"v25.02"}
ACL_INSTALL_DIR="/acl"
# Clone ACL
[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git
cd ${src_repo}
git checkout $version
git clone https://github.com/ARM-software/ComputeLibrary.git -b "${ACL_VERSION}" --depth 1 --shallow-submodules
ACL_CHECKOUT_DIR="ComputeLibrary"
# Build with scons
pushd $ACL_CHECKOUT_DIR
scons -j8 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \
os=linux arch=armv8a build=native multi_isa=1 \
fixed_format_kernels=1 openmp=1 cppthreads=0
popd
# Install ACL
sudo mkdir -p ${ACL_INSTALL_DIR}
for d in arm_compute include utils support src build
do
sudo cp -r ${ACL_CHECKOUT_DIR}/${d} ${ACL_INSTALL_DIR}/${d}
done
rm -rf $ACL_CHECKOUT_DIR

View File

@ -19,8 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.22.1
pip_install onnxscript==0.4.0
pip_install onnxruntime==1.23.0
pip_install onnxscript==0.5.3
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

12
.ci/docker/common/install_openblas.sh Normal file → Executable file
View File

@ -3,8 +3,10 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules
OPENBLAS_VERSION=${OPENBLAS_VERSION:-"v0.3.30"}
# Clone OpenBLAS
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION}" --depth 1 --shallow-submodules
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
OPENBLAS_BUILD_FLAGS="
@ -17,5 +19,7 @@ CFLAGS=-O3
BUILD_BFLOAT16=1
"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} -C $OPENBLAS_CHECKOUT_DIR
sudo make install -C $OPENBLAS_CHECKOUT_DIR
rm -rf $OPENBLAS_CHECKOUT_DIR

View File

@ -42,12 +42,6 @@ EOF
rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
# Special case for ROCM_VERSION == 7.0
if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then
rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

View File

@ -12,8 +12,8 @@ function do_install() {
rocm_version_nodot=${rocm_version//./}
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
rocm_dir="/opt/rocm"

View File

@ -66,15 +66,15 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 conda_run python setup.py bdist_wheel
CXX=g++-9 conda_run python -m build --wheel --no-isolation
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 conda_run python setup.py bdist_wheel
CXX=g++-9 conda_run python -m build --wheel --no-isolation
else
conda_run python setup.py bdist_wheel
conda_run python -m build --wheel --no-isolation
fi
# Copy the wheel to /opt for multi stage docker builds

View File

@ -0,0 +1,9 @@
#!/bin/bash
set -xe
# Script used in Linux x86 and aarch64 CD pipeline
# Workaround for exposing statically linked libstdc++ CXX11 ABI symbols.
# see: https://github.com/pytorch/pytorch/issues/133437
LIBNONSHARED=$(gcc -print-file-name=libstdc++_nonshared.a)
nm -g $LIBNONSHARED | grep " T " | grep recursive_directory_iterator | cut -c 20- > weaken-symbols.txt
objcopy --weaken-symbols weaken-symbols.txt $LIBNONSHARED $LIBNONSHARED

View File

@ -40,12 +40,16 @@ case ${DOCKER_TAG_PREFIX} in
;;
rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;
*)

View File

@ -130,7 +130,8 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/op
RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
/opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
done;
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
# cmake-3.18.4 from pip; force in case cmake3 already exists
RUN yum install -y python3-pip && \

View File

@ -62,6 +62,13 @@ ARG OPENBLAS_VERSION
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
# Install Arm Compute Library
FROM base as arm_compute
# use python3.9 to install scons
RUN python3.9 -m pip install scons==4.7.0
RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
COPY ./common/install_acl.sh install_acl.sh
RUN bash ./install_acl.sh && rm install_acl.sh
FROM base as final
# remove unnecessary python versions
@ -70,4 +77,7 @@ RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
COPY --from=arm_compute /acl /acl
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:/acl/build/:$LD_LIBRARY_PATH
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

View File

@ -86,6 +86,15 @@ FROM base as nvpl
ADD ./common/install_nvpl.sh install_nvpl.sh
RUN bash ./install_nvpl.sh && rm install_nvpl.sh
# Install Arm Compute Library
FROM base as arm_compute
# use python3.9 to install scons
RUN python3.9 -m pip install scons==4.7.0
RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
COPY ./common/install_acl.sh install_acl.sh
RUN bash ./install_acl.sh && rm install_acl.sh
FROM base as final
FROM final as cuda_final
ARG BASE_CUDA_VERSION
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
@ -93,5 +102,9 @@ COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BAS
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=nvpl /opt/nvpl/lib/ /usr/local/lib/
COPY --from=nvpl /opt/nvpl/include/ /usr/local/include/
COPY --from=arm_compute /acl /acl
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/acl/build/:$LD_LIBRARY_PATH
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

View File

@ -1,71 +0,0 @@
FROM centos:8 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# change to a valid repo
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
# enable to install ninja-build
RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
RUN yum -y update
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
ADD common/install_cpython.sh install_cpython.sh
RUN bash ./install_cpython.sh && rm install_cpython.sh
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
RUN /opt/conda/bin/conda install -y cmake
FROM base as intel
# Install MKL
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=conda /opt/conda /opt/conda
ENV PATH=/opt/conda/bin:$PATH
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM base as jni
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM base as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=intel /opt/intel /opt/intel
COPY --from=conda /opt/conda /opt/conda
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
RUN yum install -y ninja-build

View File

@ -28,6 +28,7 @@ fi
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
OPENBLAS_VERSION=${OPENBLAS_VERSION:-}
ACL_VERSION=${ACL_VERSION:-}
case ${image} in
manylinux2_28-builder:cpu)
@ -41,13 +42,6 @@ case ${image} in
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.30"
;;
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
manylinuxs390x-builder:cpu-s390x)
TARGET=final
@ -82,7 +76,7 @@ case ${image} in
;;
manylinux2_28-builder:rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
TARGET=rocm_final
@ -90,6 +84,10 @@ case ${image} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
manylinux2_28-builder:xpu)
@ -121,7 +119,8 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION:-}" \
--build-arg "ACL_VERSION=${ACL_VERSION:-}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \

View File

@ -10,6 +10,11 @@ boto3==1.35.42
#Pinned versions: 1.19.12, 1.16.34
#test that import:
build==1.3.0
#Description: A simple, correct Python build frontend.
#Pinned versions: 1.3.0
#test that import:
click
#Description: Command Line Interface Creation Kit
#Pinned versions:
@ -47,10 +52,10 @@ flatbuffers==24.12.23
#Pinned versions: 24.12.23
#test that import:
hypothesis==5.35.1
hypothesis==6.56.4
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
#Pinned versions: 5.35.1
#Pinned versions: 6.56.4
#test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
junitparser==2.1.1
@ -93,7 +98,7 @@ librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
#Pinned versions:
#test that import:
mypy==1.16.0 ; platform_system != "Windows"
mypy==1.16.0 ; platform_system == "Linux"
# Pin MyPy version because new errors are likely to appear with each release
# Skip on Windows as lots of type annotations are POSIX specific
#Description: linter
@ -106,10 +111,10 @@ networkx==2.8.8
#Pinned versions: 2.8.8
#test that import: functorch
ninja==1.11.1.3
ninja==1.11.1.4
#Description: build system. Used in some tests. Used in build to generate build
#time tracing information
#Pinned versions: 1.11.1.3
#Pinned versions: 1.11.1.4
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
@ -164,12 +169,12 @@ optree==0.13.0
pillow==11.0.0
#Description: Python Imaging Library fork
#Pinned versions: 10.3.0
#Pinned versions: 11.0.0
#test that import:
protobuf==5.29.4
protobuf==5.29.5
#Description: Google's data interchange format
#Pinned versions: 5.29.4
#Pinned versions: 5.29.5
#test that import: test_tensorboard.py, test/onnx/*
psutil
@ -212,7 +217,7 @@ pytest-subtests==0.13.1
#Pinned versions:
#test that import:
xdoctest==1.1.0
xdoctest==1.3.0
#Description: runs doctests in pytest
#Pinned versions: 1.1.0
#test that import:
@ -263,7 +268,7 @@ scipy==1.14.1 ; python_version >= "3.12"
#test that import:
# needed by torchgen utils
typing-extensions>=4.10.0
typing-extensions==4.12.2
#Description: type hints for python
#Pinned versions:
#test that import:
@ -336,7 +341,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.4.0
onnxscript==0.5.3
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -356,9 +361,10 @@ pwlf==2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
pyyaml
pyyaml==6.0.2
pyzstd
setuptools>=70.1.0
setuptools==78.1.1
packaging==23.1
six
scons==4.5.2 ; platform_machine == "aarch64"
@ -373,13 +379,16 @@ dataclasses_json==0.6.7
#Pinned versions: 0.6.7
#test that import:
cmake==4.0.0
cmake==3.31.6
#Description: required for building
tlparse==0.4.0
#Description: required for log parsing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
filelock==3.18.0
#Description: required for inductor testing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x" and platform_system != "Darwin"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py

View File

@ -1,8 +1,15 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@d53b0ffb9b1cda68260693ea98f3483823c88d8e#egg=pytorch_sphinx_theme2
standard-imghdr==3.13.0; python_version >= "3.13"
#Description: This is needed by Sphinx, so it needs to be added here.
# The reasons are as follows:
# 1) This module has been removed from the Python standard library since Python 3.13(https://peps.python.org/pep-0594/#imghdr);
# 2) The current version of Sphinx (5.3.0) is not compatible with Python 3.13.
# Once Sphinx is upgraded to a version compatible with Python 3.13 or later, we can remove this dependency.
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.

View File

@ -1,11 +1,11 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_ROCM ?= 6.4
DESIRED_ROCM ?= 7.0
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma-rocm/build_magma.sh
.PHONY: all
all: magma-rocm70
all: magma-rocm64
all: magma-rocm63
@ -24,6 +25,11 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-rocm70
magma-rocm70: DESIRED_ROCM := 7.0
magma-rocm70:
$(DOCKER_RUN)
.PHONY: magma-rocm64
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:

View File

@ -6,8 +6,8 @@ set -eou pipefail
# The script expects DESIRED_CUDA and PACKAGE_NAME to be set
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
# Folders for the build
PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata
@ -20,7 +20,7 @@ mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RE
# Fetch magma sources and verify checksum
pushd ${PACKAGE_DIR}
git clone https://bitbucket.org/icl/magma.git
git clone https://github.com/jeffdaily/magma
pushd magma
git checkout ${MAGMA_VERSION}
popd

View File

@ -142,7 +142,7 @@ time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
python -m build --wheel --no-isolation --outdir /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist at $(date)"
# Build libtorch packages

View File

@ -104,7 +104,7 @@ if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr
fi
echo "Calling 'python -m pip install .' at $(date)"
echo "Calling -m pip install . -v --no-build-isolation at $(date)"
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

View File

@ -107,6 +107,10 @@ if [[ $ROCM_INT -ge 60200 ]]; then
ROCM_SO_FILES+=("librocm-core.so")
fi
if [[ $ROCM_INT -ge 70000 ]]; then
ROCM_SO_FILES+=("librocroller.so")
fi
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

View File

@ -89,7 +89,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export USE_MKLDNN=1
export USE_MKLDNN_ACL=1
export ACL_ROOT_DIR=/ComputeLibrary
export ACL_ROOT_DIR=/acl
fi
if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then
@ -290,13 +290,13 @@ else
WERROR=1 python setup.py clean
WERROR=1 python setup.py bdist_wheel
WERROR=1 python -m build --wheel --no-isolation
else
python setup.py clean
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
python setup.py bdist_wheel
python -m build --wheel --no-isolation
fi
pip_install_whl "$(echo dist/*.whl)"

View File

@ -67,7 +67,7 @@ fi
# wheels with cxx11-abi
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' ]]; then
if [[ "$(uname)" != 'Darwin' && "$(uname -m)" != "s390x" ]]; then
# We also check that there are cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"

View File

@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
time make VERBOSE=1 html -j
time make VERBOSE=1 html
popd
popd

View File

@ -36,11 +36,11 @@ fi
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python -m build --wheel --no-isolation
else
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python -m build --wheel --no-isolation -C--build-option=--plat-name=macosx_11_0_arm64
fi
if which sccache > /dev/null; then
print_sccache_stats

View File

@ -26,6 +26,7 @@ if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_aten_comm_compute_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper

View File

@ -32,6 +32,9 @@ LIBTORCH_NAMESPACE_LIST = (
"torch::",
)
# Patterns for detecting statically linked libstdc++ symbols
STATICALLY_LINKED_CXX11_ABI = [re.compile(r".*recursive_directory_iterator.*")]
def _apply_libtorch_symbols(symbols):
return [
@ -53,12 +56,17 @@ def get_symbols(lib: str) -> list[tuple[str, str, str]]:
return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]
def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
def grep_symbols(
lib: str, patterns: list[Any], symbol_type: str | None = None
) -> list[str]:
def _grep_symbols(
symbols: list[tuple[str, str, str]], patterns: list[Any]
) -> list[str]:
rc = []
for _s_addr, _s_type, s_name in symbols:
# Filter by symbol type if specified
if symbol_type and _s_type != symbol_type:
continue
for pattern in patterns:
if pattern.match(s_name):
rc.append(s_name)
@ -80,6 +88,18 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
return functools.reduce(list.__add__, (x.result() for x in tasks), [])
def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:
cxx11_statically_linked_symbols = grep_symbols(
lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="T"
)
num_statically_linked_symbols = len(cxx11_statically_linked_symbols)
print(f"num_statically_linked_symbols (T): {num_statically_linked_symbols}")
if num_statically_linked_symbols > 0:
raise RuntimeError(
f"Found statically linked libstdc++ symbols (recursive_directory_iterator): {cxx11_statically_linked_symbols[:100]}"
)
def check_lib_symbols_for_abi_correctness(lib: str) -> None:
print(f"lib: {lib}")
cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)
@ -107,6 +127,7 @@ def main() -> None:
libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")
check_lib_symbols_for_abi_correctness(libtorch_cpu_path)
check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
if __name__ == "__main__":

View File

@ -34,12 +34,14 @@ fi
# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
fi
fi
echo "Environment variables:"
@ -435,7 +437,7 @@ test_inductor_distributed() {
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose
python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_aten_comm_compute_reordering distributed/test_compute_comm_reordering --verbose
assert_git_not_dirty
}
@ -1415,7 +1417,7 @@ EOF
pip3 install -r requirements.txt
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
python setup.py bdist_wheel --bdist-dir="base_bdist_tmp" --dist-dir="base_dist"
python -m build --wheel --no-isolation -C--build-option=--bdist-dir="base_bdist_tmp" --outdir "base_dist"
python -mpip install base_dist/*.whl
echo "::endgroup::"
@ -1617,7 +1619,7 @@ test_operator_benchmark() {
test_inductor_set_cpu_affinity
cd benchmarks/operator_benchmark/pt_extension
python -m pip install .
python -m pip install . -v --no-build-isolation
cd "${TEST_DIR}"/benchmarks/operator_benchmark
$TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \
@ -1630,6 +1632,25 @@ test_operator_benchmark() {
--expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"
}
test_operator_microbenchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
TEST_DIR=$(pwd)
cd benchmarks/operator_benchmark/pt_extension
python -m pip install .
cd "${TEST_DIR}"/benchmarks/operator_benchmark
for OP_BENCHMARK_TESTS in matmul mm addmm bmm; do
$TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}_compile.json" \
--benchmark-name "PyTorch operator microbenchmark" --use-compile
$TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}.json" \
--benchmark-name "PyTorch operator microbenchmark"
done
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
@ -1686,6 +1707,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then
test_operator_benchmark cpu ${TEST_MODE}
fi
elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then
test_operator_microbenchmark
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
@ -1794,6 +1817,8 @@ elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == "b200-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then
test_h100_cutlass_backend
else

View File

@ -0,0 +1,32 @@
#!/bin/bash
set -ex -o pipefail
# Suppress ANSI color escape sequences
export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
echo "Environment variables"
env
echo "Testing FA3 stable wheel still works with currently built torch"
echo "Installing ABI Stable FA3 wheel"
# The wheel was built on https://github.com/Dao-AILab/flash-attention/commit/b3846b059bf6b143d1cd56879933be30a9f78c81
# on torch nightly torch==2.9.0.dev20250830+cu129
$MAYBE_SUDO pip -q install https://s3.amazonaws.com/ossci-linux/wheels/flash_attn_3-3.0.0b1-cp39-abi3-linux_x86_64.whl
pushd flash-attention/hopper
export PYTHONPATH=$PWD
pytest -v -s \
"test_flash_attn.py::test_flash_attn_output[1-1-192-False-False-False-0.0-False-False-mha-dtype0]" \
"test_flash_attn.py::test_flash_attn_varlen_output[511-1-64-True-False-False-0.0-False-False-gqa-dtype2]" \
"test_flash_attn.py::test_flash_attn_kvcache[1-128-128-False-False-True-None-0.0-False-False-True-False-True-False-gqa-dtype0]" \
"test_flash_attn.py::test_flash_attn_race_condition[97-97-192-True-dtype0]" \
"test_flash_attn.py::test_flash_attn_combine[2-3-64-dtype1]" \
"test_flash_attn.py::test_flash3_bw_compatibility"
popd

View File

@ -70,7 +70,7 @@ sccache --zero-stats
sccache --show-stats
# Build the wheel
python setup.py bdist_wheel
python -m build --wheel --no-build-isolation
if ($LASTEXITCODE -ne 0) { exit 1 }
# Install the wheel locally

View File

@ -38,10 +38,12 @@ if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Update CMake
:: TODO: Investigate why this helps MKL detection, even when CMake from choco is not used
call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=System' --apply-install-arguments-to-dependencies --version=3.27.9
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: TODO: Move to .ci/docker/requirements-ci.txt
call pip install mkl==2024.2.0 mkl-static==2024.2.0 mkl-include==2024.2.0
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
@ -130,7 +132,7 @@ if "%USE_CUDA%"=="1" (
:: Print all existing environment variable for debugging
set
python setup.py bdist_wheel
python -m build --wheel --no-isolation
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
sccache --show-stats

View File

@ -37,27 +37,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
fi
# TODO: Move both of them to Windows AMI
python -m pip install tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
# scipy from 1.6.3 to 1.10
# expecttest from 0.1.3 to 0.3.0
# xdoctest from 1.0.2 to 1.3.0
python -m pip install "future==0.18.2" "hypothesis==5.35.1" "expecttest==0.3.0" "librosa>=0.6.2" "scipy==1.10.1" "psutil==5.9.1" "pynvml==11.4.1" "pillow==9.2.0" "unittest-xml-reporting<=3.2.0,>=2.0.0" "pytest==7.1.3" "pytest-xdist==2.5.0" "pytest-flakefinder==1.1.0" "pytest-rerunfailures==10.3" "pytest-shard==0.1.2" "sympy==1.11.1" "xdoctest==1.3.0" "pygments==2.12.0" "opt-einsum>=3.3" "networkx==2.8.8" "mpmath==1.2.1" "pytest-cpp==2.3.0" "boto3==1.35.42"
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.15.1.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.4.0
# Install parameterized
python -m pip install parameterized==0.8.1
# Install pulp for testing ilps under torch\distributed\_tools
python -m pip install pulp==2.9.0
# TODO: Move this to .ci/docker/requirements-ci.txt
python -m pip install "psutil==5.9.1" "pynvml==11.4.1" "pytest-shard==0.1.2"
run_tests() {
# Run nvidia-smi if available

View File

@ -48,7 +48,7 @@ sccache --zero-stats
sccache --show-stats
:: Call PyTorch build script
python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"
python -m build --wheel --no-isolation --outdir "%PYTORCH_FINAL_PACKAGE_DIR%"
:: show sccache stats
sccache --show-stats

View File

@ -37,10 +37,10 @@ IF "%CUDA_PATH_V128%"=="" (
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V128%"

View File

@ -28,5 +28,5 @@ start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_t
if errorlevel 1 exit /b 1
set "PATH=%CD%\Python\Scripts;%CD%\Python;%PATH%"
%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel
%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel build
if errorlevel 1 exit /b 1

View File

@ -86,7 +86,7 @@ copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_
goto build_end
:pytorch
%PYTHON_EXEC% setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"
%PYTHON_EXEC% -m build --wheel --no-isolation --outdir "%PYTORCH_FINAL_PACKAGE_DIR%"
:build_end
IF ERRORLEVEL 1 exit /b 1

View File

@ -63,7 +63,7 @@ if errorlevel 1 exit /b 1
call %CONDA_HOME%\condabin\activate.bat testenv
if errorlevel 1 exit /b 1
call conda install -y -q -c conda-forge libuv=1.39
call conda install -y -q -c conda-forge libuv=1.51
call conda install -y -q intel-openmp
echo "install and test libtorch"

View File

@ -18,7 +18,7 @@ if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake
%PYTHON_EXEC% -m pip install pyyaml
%PYTHON_EXEC% -m pip install mkl-include mkl-static
%PYTHON_EXEC% -m pip install boto3 ninja typing_extensions setuptools==72.1.0
%PYTHON_EXEC% -m pip install boto3 requests ninja typing_extensions setuptools==72.1.0
where cmake.exe

View File

@ -143,7 +143,8 @@ case $desired_python in
RENAME_WHEEL=false
;;
3.13t)
echo "Using 3.13 deps"
echo "Using 3.13t deps"
mac_version='macosx-11.0-arm64'
NUMPY_PINNED_VERSION="==2.1.0"
RENAME_WHEEL=false
;;
@ -185,11 +186,11 @@ export USE_QNNPACK=OFF
export BUILD_TEST=OFF
pushd "$pytorch_rootdir"
echo "Calling setup.py bdist_wheel at $(date)"
echo "Calling -m build --wheel --no-isolation at $(date)"
_PYTHON_HOST_PLATFORM=${mac_version} ARCHFLAGS="-arch arm64" python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name "${mac_version//[-.]/_}"
_PYTHON_HOST_PLATFORM=${mac_version} ARCHFLAGS="-arch arm64" python -m build --wheel --no-isolation --outdir "$whl_tmp_dir" -C--plat-name="${mac_version//[-.]/_}"
echo "Finished setup.py bdist_wheel at $(date)"
echo "Finished -m build --wheel --no-isolation at $(date)"
if [[ $package_type != 'libtorch' ]]; then
echo "delocating wheel dependencies"

View File

@ -1,47 +0,0 @@
#!/bin/bash
# =================== The following code **should** be executed inside Docker container ===================
# Install dependencies
sudo apt-get -y update
sudo apt-get -y install expect-dev
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex
version=${DOCS_VERSION:-nightly}
echo "version: $version"
# Build functorch docs
pushd $pt_checkout/functorch/docs
pip -q install -r requirements.txt
make html
popd
git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages
pushd functorch_ghpages
if [ $version == "main" ]; then
version=nightly
fi
git rm -rf "$version" || true
mv "$pt_checkout/functorch/docs/build/html" "$version"
git add "$version" || true
git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
git status
if [[ "${WITH_PUSH:-}" == true ]]; then
git push -u origin gh-pages
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -59,16 +59,19 @@ performance-*,
-performance-enum-size,
readability-container-size-empty,
readability-delete-null-pointer,
readability-duplicate-include
readability-duplicate-include,
readability-misplaced-array-index,
readability-redundant*
readability-redundant*,
readability-simplify-subscript-expr,
readability-string-compare,
-readability-redundant-access-specifiers,
-readability-redundant-control-flow,
-readability-redundant-inline-specifier,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'
LineFilter:
- name: '/usr/include/.*'
CheckOptions:
cppcoreguidelines-special-member-functions.AllowSoleDefaultDtor: true
cppcoreguidelines-special-member-functions.AllowImplicitlyDeletedCopyOrMove: true

View File

@ -1,6 +1,10 @@
---
name: "⚠️ CI SEV"
about: Tracking incidents for PyTorch's CI infra.
title: ''
labels: ''
assignees: ''
---
> NOTE: Remember to label this issue with "`ci: sev`"

View File

@ -0,0 +1,18 @@
---
name: DISABLE AUTOREVERT
about: Disables autorevert when open
title: "❌​\U0001F519 [DISABLE AUTOREVERT]"
labels: 'ci: disable-autorevert'
assignees: ''
---
This issue, while open, disables the autorevert functionality.
More details can be found [here](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md)
## Why are you disabling autorevert?
## Links to any issues/commits/errors that shows the source of problem

View File

@ -1,8 +1,10 @@
---
name: Disable CI jobs (PyTorch Dev Infra only)
about: Use this template to disable CI jobs
title: "DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]"
labels: "module: ci"
title: DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]
labels: 'module: ci'
assignees: ''
---
> For example, DISABLED pull / win-vs2022-cpu-py3 / test (default). Once

View File

@ -22,6 +22,9 @@ self-hosted-runner:
- linux.arm64.m7g.4xlarge
- linux.arm64.m7g.4xlarge.ephemeral
- linux.arm64.r7g.12xlarge.memory
- linux.aws.h100
- linux.aws.h100.4
- linux.aws.h100.8
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu

View File

@ -23,9 +23,6 @@ runs:
run: |
.github\scripts\kill_active_ssh_sessions.ps1
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
# Cleaning up Windows workspace sometimes fails flakily with device or resource busy
# error, meaning one or more processes haven't stopped completely yet. So trying to
# retry this step several time similar to how checkout-pytorch GHA does

View File

@ -1 +1 @@
1983609239caaab24ab1ed2bfa2aa92e8c76c1b1
0ad9951c416d33c5da4f7a504fb162cbe62386f5

View File

@ -1 +1 @@
c77852e117bdf056c8e9a087e51d6f65cf6ba53d
2a9138a26ee257fef05310ad3fecf7c55fe80d73

View File

@ -202,7 +202,7 @@ ARG max_jobs=16
ENV MAX_JOBS=${max_jobs}
ARG nvcc_threads=4
ENV NVCC_THREADS=$nvcc_threads
ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'
ARG torch_cuda_arch_list='8.0 8.6 8.9 9.0'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
ARG USE_SCCACHE
@ -297,16 +297,28 @@ RUN echo "[INFO] Listing current directory before torch install step:" && \
echo "[INFO] Showing torch_build_versions.txt content:" && \
cat torch_build_versions.txt
# Install build and runtime dependencies, this is needed for flashinfer install
COPY requirements/build.txt requirements/build.txt
COPY use_existing_torch.py use_existing_torch.py
RUN python3 use_existing_torch.py
RUN cat requirements/build.txt
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
if ! python3 -m uv --version > /dev/null 2>&1; then \
python3 -m pip install uv==0.8.4; \
fi
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/build.txt
# Default mount file as placeholder, this just avoid the mount error
ARG TORCH_WHEELS_PATH="./requirements"
# Install torch, torchaudio and torchvision
@ -332,13 +344,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
# Install xformers wheel from previous stage
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system /wheels/xformers/*.whl --verbose
# Build flashinfer from source.
ARG torch_cuda_arch_list='8.0;8.9;9.0a;10.0a;12.0'
# install package for build flashinfer
# see issue: https://github.com/flashinfer-ai/flashinfer/issues/738
RUN pip install build==1.3.0
RUN pip freeze | grep -E 'setuptools|packaging|build'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

View File

@ -1,9 +1,14 @@
import glob
import os
requires_files = glob.glob("requirements/*.txt")
requires_files += ["pyproject.toml"]
for file in requires_files:
if not os.path.exists(file):
print(f"!!! skipping missing {file}")
continue
print(f">>> cleaning {file}")
with open(file) as f:
lines = f.readlines()

View File

@ -1,43 +1,44 @@
tracking_issue: 24422
ciflow_tracking_issue: 64124
ciflow_push_tags:
- ciflow/b200
- ciflow/b200-symm-mem
- ciflow/binaries
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/triton_binaries
- ciflow/h100
- ciflow/h100-cutlass-backend
- ciflow/h100-distributed
- ciflow/h100-symm-mem
- ciflow/inductor
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-cu126
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-compare
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-cu126
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly
- ciflow/op-benchmark
- ciflow/periodic
- ciflow/periodic-rocm-mi300
- ciflow/pull
- ciflow/quantization-periodic
- ciflow/riscv64
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390
- ciflow/riscv64
- ciflow/slow
- ciflow/torchbench
- ciflow/triton_binaries
- ciflow/trunk
- ciflow/unstable
- ciflow/xpu
- ciflow/vllm
- ciflow/torchbench
- ciflow/op-benchmark
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
- ciflow/win-arm64
- ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
- ciflow/b200
- ciflow/xpu
retryable_workflows:
- pull
- trunk
@ -46,4 +47,4 @@ retryable_workflows:
- inductor-A100-perf-nightly
labeler_config: labeler.yml
label_to_label_config: label_to_label.yml
mergebot: True
mergebot: true

View File

@ -1,36 +0,0 @@
boto3==1.35.42
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.18.0
hypothesis==6.56.4
librosa>=0.6.2
mpmath==1.3.0
networkx==2.8.7
ninja==1.10.2.4
numba==0.59.0
numpy==1.26.4
opt-einsum>=3.3
optree==0.13.0
packaging==23.1
parameterized==0.8.1
pillow==10.3.0
protobuf==5.29.5
psutil==5.9.8
pygments==2.15.0
pytest-cpp==2.3.0
pytest-flakefinder==1.1.0
pytest-rerunfailures==10.3
pytest-subtests==0.13.1
pytest-xdist==3.3.1
pytest==7.3.2
pyyaml==6.0.2
scipy==1.12.0
setuptools==78.1.1
sympy==1.13.3
tlparse==0.4.0
tensorboard==2.13.0
typing-extensions==4.12.2
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
z3-solver==4.15.1.0

View File

@ -502,6 +502,7 @@ def perform_misc_tasks(
job_name: str,
pr_body: str,
branch: Optional[str] = None,
tag: Optional[str] = None,
) -> None:
"""
In addition to apply the filter logic, the script also does the following
@ -509,7 +510,9 @@ def perform_misc_tasks(
"""
set_output(
"keep-going",
branch == MAIN_BRANCH or check_for_setting(labels, pr_body, "keep-going"),
branch == MAIN_BRANCH
or bool(tag and re.match(r"^trunk/[a-f0-9]{40}$", tag))
or check_for_setting(labels, pr_body, "keep-going"),
)
set_output(
"ci-verbose-test-logs",
@ -634,6 +637,7 @@ def main() -> None:
job_name=args.job_name,
pr_body=pr_body if pr_body else "",
branch=args.branch,
tag=tag,
)
# Set the filtered test matrix as the output

View File

@ -30,7 +30,7 @@ CUDA_ARCHES_CUDNN_VERSION = {
}
# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this
ROCM_ARCHES = ["6.3", "6.4"]
ROCM_ARCHES = ["6.4", "7.0"]
XPU_ARCHES = ["xpu"]

View File

@ -127,53 +127,6 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
),
]
ROCM_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_variant="rocm",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["6.4"],
python_versions=["3.10"],
),
ciflow_config=CIFlowConfig(
labels={
LABEL_CIFLOW_BINARIES,
LABEL_CIFLOW_BINARIES_WHEEL,
LABEL_CIFLOW_ROCM,
},
isolated_workflow=True,
),
branches="main",
),
]
LINUX_BINARY_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["13.0"],
python_versions=["3.12"],
),
branches="main",
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="libtorch",
build_variant=generate_binary_build_matrix.RELEASE,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.LINUX,
generate_binary_build_matrix.RELEASE,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
branches="main",
),
]
WINDOWS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS,
@ -259,39 +212,6 @@ WINDOWS_BINARY_BUILD_WORKFLOWS = [
),
]
WINDOWS_BINARY_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS,
package_type="libtorch",
build_variant=generate_binary_build_matrix.RELEASE,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS,
generate_binary_build_matrix.RELEASE,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
branches="main",
ciflow_config=CIFlowConfig(
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS,
package_type="libtorch",
build_variant=generate_binary_build_matrix.DEBUG,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS,
generate_binary_build_matrix.DEBUG,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
branches="main",
ciflow_config=CIFlowConfig(
isolated_workflow=True,
),
),
]
MACOS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
@ -372,23 +292,10 @@ def main() -> None:
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
S390X_BINARY_BUILD_WORKFLOWS,
),
(
# Give rocm it's own workflow file
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
ROCM_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
LINUX_BINARY_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("windows_binary_build_workflow.yml.j2"),
WINDOWS_BINARY_BUILD_WORKFLOWS,
),
(
jinja_env.get_template("windows_binary_build_workflow.yml.j2"),
WINDOWS_BINARY_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("macos_binary_build_workflow.yml.j2"),
MACOS_BINARY_BUILD_WORKFLOWS,

View File

@ -67,7 +67,7 @@ jobs:
# an OOM issue when running the job, so this upgrades the runner from 4xlarge
# to the next available tier of 12xlarge. So much memory just to generate cpp
# doc
runner: ${{ inputs.runner_prefix }}linux.12xlarge
runner: ${{ inputs.runner_prefix }}linux.12xlarge.memory
# TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)
# Let's try to figure out how this can be improved
timeout-minutes: 360

View File

@ -40,6 +40,15 @@ jobs:
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
# See https://github.com/pytorch/pytorch/pull/134215#issuecomment-2332128790
PYI_FILES_TO_ADD=""
for file in ${CHANGED_FILES}; do
if [[ "${file}" == *".pyi.in" ]]; then
PYI_FILES_TO_ADD="${PYI_FILES_TO_ADD} ${file//.in/}"
fi
done
CHANGED_FILES="${CHANGED_FILES}${PYI_FILES_TO_ADD}"
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"

View File

@ -0,0 +1,255 @@
# The point of this workflow is to test that a FA3 wheel that was built based off the
# stable ABI as of torch nightly 20250830 can still run on the newer torch.
#
# This workflow is very similar to the _linux-test.yml workflow, with the following
# differences:
# 1. It is simpler (there is no test matrix)
# 2. It pulls flash-attention as a secondary repository in order to access the tests.
# Note that it does not BUILD anything from flash-attention, as we have a prebuilt
# wheel. We pull flash-attention only to run a few tests.
# 3. It runs only FA3 tests. No PyTorch tests are run.
name: linux-test-stable-fa3
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image:
required: true
type: string
description: Docker image to run in.
timeout-minutes:
required: false
type: number
default: 30
description: |
Set the maximum (in minutes) how long the workflow should take to finish
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
VLLM_TEST_HUGGING_FACE_TOKEN:
required: false
description: |
HF Auth token to test vllm
SCRIBE_GRAPHQL_ACCESS_TOKEN:
required: false
description: |
FB app token to write to scribe endpoint
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: linux.aws.h100
timeout-minutes: ${{ inputs.timeout-minutes || 30 }}
permissions:
id-token: write
contents: read
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: true
- name: Checkout flash-attention as a secondary repository
uses: actions/checkout@v4
with:
repository: Dao-AILab/flash-attention
path: flash-attention
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a container runner
shell: bash
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Download build artifacts
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Set Test step time
id: test-timeout
shell: bash
env:
JOB_TIMEOUT: ${{ inputs.timeout-minutes }}
run: |
echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}"
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Test
id: test
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
PR_NUMBER: ${{ github.event.pull_request.number }}
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
GITHUB_JOB: ${{ github.job }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
SHM_SIZE: '2g'
DOCKER_IMAGE: ${{ inputs.docker-image }}
VLLM_TEST_HUGGING_FACE_TOKEN: ${{ secrets.VLLM_TEST_HUGGING_FACE_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
ARTIFACTS_FILE_SUFFIX: ${{ github.job }}-${{ steps.get-job-id.outputs.job-id }}
run: |
set -x
TEST_COMMAND=.ci/pytorch/test_fa3_abi_stable.sh
# Leaving 1GB for the runner and other things
TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)
# https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap
# comes from https://github.com/pytorch/test-infra/pull/6058
TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3))
SHM_OPTS="--shm-size=${SHM_SIZE}"
JENKINS_USER="--user jenkins"
DOCKER_SHELL_CMD=
# detached container should get cleaned up by teardown_ec2_linux
# TODO: Stop building test binaries as part of the build phase
# Used for GPU_FLAG, SHM_OPTS, JENKINS_USER and DOCKER_SHELL_CMD since that doesn't play nice
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
-e GITHUB_REPOSITORY \
-e GITHUB_WORKFLOW \
-e GITHUB_JOB \
-e GITHUB_RUN_ID \
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e JOB_NAME \
-e BASE_SHA \
-e BRANCH \
-e SHA1 \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e HUGGING_FACE_HUB_TOKEN \
-e VLLM_TEST_HUGGING_FACE_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e ARTIFACTS_FILE_SUFFIX \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--ipc=host \
${SHM_OPTS} \
--tty \
--detach \
--name="${container_name}" \
${JENKINS_USER} \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
${DOCKER_SHELL_CMD}
)
echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"
- name: Collect backtraces from coredumps (if any)
if: always()
run: |
# shellcheck disable=SC2156
find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \;
- name: Store Core dumps on S3
uses: seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a # v5.1.0
if: failure()
with:
name: coredumps-fa3-stable-abi-smoke-tests
retention-days: 14
if-no-files-found: ignore
path: ./**/core.[1-9]*
- name: Upload utilization stats
if: ${{ always() && steps.test.conclusion && steps.test.conclusion != 'skipped' }}
continue-on-error: true
uses: ./.github/actions/upload-utilization-stats
with:
job_id: ${{ steps.get-job-id.outputs.job-id }}
job_name: ${{ steps.get-job-id.outputs.job-name }}
workflow_name: ${{ github.workflow }}
workflow_run_id: ${{github.run_id}}
workflow_attempt: ${{github.run_attempt}}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'

View File

@ -273,6 +273,8 @@ jobs:
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
EXTRA_FLAGS: ${{ matrix.extra_flags || '' }}
OP_BENCHMARK_TESTS: ${{ matrix.op_benchmark_tests }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

View File

@ -85,7 +85,7 @@ jobs:
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
pip-requirements-file: .ci/docker/requirements-ci.txt
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e # v3.0.0

View File

@ -122,7 +122,7 @@ jobs:
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
pip-requirements-file: .ci/docker/requirements-ci.txt
- name: Start monitoring script
id: monitor-script

View File

@ -84,9 +84,6 @@ jobs:
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:

View File

@ -77,9 +77,6 @@ jobs:
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:
@ -106,18 +103,6 @@ jobs:
with:
cuda-version: ${{ inputs.cuda-version }}
# TODO: Move to a requirements.txt file for windows
- name: Install pip dependencies
uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e # v3.0.0
with:
shell: bash
timeout_minutes: 5
max_attempts: 5
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install 'xdoctest>=1.1.0'
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
@ -272,15 +257,6 @@ jobs:
shell: bash
run: python3 .github/scripts/parse_ref.py
- name: Uninstall PyTorch
if: always()
continue-on-error: true
shell: bash
run: |
# This step removes PyTorch installed by the test to give a clean slate
# to the next job
python3 -mpip uninstall -y torch
- name: Teardown Windows
uses: ./.github/actions/teardown-win
if: always()

60
.github/workflows/b200-symm-mem.yml vendored Normal file
View File

@ -0,0 +1,60 @@
name: Limited CI for symmetric memory tests on B200
on:
pull_request:
paths:
- .github/workflows/b200-symm-mem.yml
workflow_dispatch:
push:
tags:
- ciflow/b200-symm-mem/*
schedule:
- cron: 22 8 * * * # about 1:22am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "b200-symm-mem", shard: 1, num_shards: 1, runner: "linux.dgx.b200.8" },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm100-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -36,7 +36,7 @@ jobs:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.3", "rocm6.4", "cpu"]
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.4", "rocm7.0", "cpu"]
steps:
- name: Build docker image
uses: pytorch/pytorch/.github/actions/binary-docker-build@main

View File

@ -52,8 +52,8 @@ jobs:
{ tag: "cuda12.9" },
{ tag: "cuda12.8" },
{ tag: "cuda12.6" },
{ tag: "rocm6.3" },
{ tag: "rocm6.4" },
{ tag: "rocm7.0" },
{ tag: "cpu" },
]
steps:

View File

@ -34,7 +34,7 @@ jobs:
id-token: write
strategy:
matrix:
rocm_version: ["64", "63"]
rocm_version: ["70", "64"]
steps:
- name: Checkout PyTorch
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -52,11 +52,10 @@ jobs:
{ name: "manylinuxaarch64-builder", tag: "cuda13.0", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.3", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm7.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cpu", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28_aarch64-builder", tag: "cpu-aarch64", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxcxx11-abi-builder", tag: "cpu-cxx11-abi", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "xpu", runner: "linux.9xlarge.ephemeral" },
]
runs-on: ${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}

View File

@ -55,7 +55,7 @@ jobs:
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
- device: "rocm"
rocm_version: "6.4"
rocm_version: "7.0"
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "cuda"
rocm_version: ""

View File

@ -59,7 +59,6 @@ jobs:
pytorch-linux-jammy-py3.13-clang12,
pytorch-linux-jammy-rocm-n-py3,
pytorch-linux-noble-rocm-n-py3,
pytorch-linux-noble-rocm-alpha-py3,
pytorch-linux-jammy-rocm-n-py3-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-clang12,
pytorch-linux-jammy-py3.10-gcc11,

View File

@ -316,121 +316,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_3-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm6_3-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_3-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm6_3-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm6_3-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm6.3
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_3-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_3-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm6_3-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_4-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -545,3 +430,118 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm7_0-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm7_0-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm7_0-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm7_0-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm7_0-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm7.0
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm7_0-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm7_0-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm7_0-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -1,87 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-libtorch-release
on:
push:
branches:
- main
tags:
- 'ciflow/trunk/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-libtorch-release
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-release-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-cpu-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cpu
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cpu-shared-with-deps-release
build_environment: linux-binary-libtorch-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cpu-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cpu-shared-with-deps-release-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cpu
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cpu-shared-with-deps-release
build_environment: linux-binary-libtorch-release
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -1,88 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel
on:
push:
branches:
- main
tags:
- 'ciflow/trunk/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_12-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu130
GPU_ARCH_VERSION: "13.0"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda13_0-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_12-cuda13_0-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu130
GPU_ARCH_VERSION: "13.0"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

File diff suppressed because it is too large Load Diff

View File

@ -1,136 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel-rocm
on:
push:
branches:
- main
tags:
- 'ciflow/binaries/*'
- 'ciflow/binaries_wheel/*'
- 'ciflow/rocm/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel-rocm
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-rocm-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_10-rocm6_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: manywheel-py3_10-rocm6_4
build_environment: linux-binary-manywheel-rocm
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-rocm6_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_10-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.10"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: manywheel-py3_10-rocm6_4
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: manylinux2_28-builder
custom-tag-prefix: rocm6.4
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm

View File

@ -1,261 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: windows-binary-libtorch-debug
on:
push:
branches:
- main
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BUILD_ENVIRONMENT: windows-binary-libtorch-debug
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
OS: windows
concurrency:
group: windows-binary-libtorch-debug-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-cpu-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cpu-shared-with-deps-debug
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cpu-shared-with-deps-debug-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cpu-shared-with-deps-debug-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cpu-shared-with-deps-debug
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1

View File

@ -1,261 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: windows-binary-libtorch-release
on:
push:
branches:
- main
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BUILD_ENVIRONMENT: windows-binary-libtorch-release
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
OS: windows
concurrency:
group: windows-binary-libtorch-release-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-cpu-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cpu-shared-with-deps-release
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cpu-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cpu-shared-with-deps-release-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cpu-shared-with-deps-release
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1

View File

@ -106,6 +106,16 @@ jobs:
{ config: "dynamic_aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
]}
secrets: inherit

View File

@ -0,0 +1,100 @@
name: operator_microbenchmark
on:
push:
tags:
- ciflow/op-benchmark/*
workflow_dispatch:
schedule:
# Run at 06:00 UTC everyday
- cron: 0 6 * * *
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
# H100 A100 runners
opmicrobenchmark-build:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '8.0 9.0'
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.h100" },
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
]}
secrets: inherit
opmicrobenchmark-test:
name: opmicrobenchmark-test
uses: ./.github/workflows/_linux-test.yml
needs: opmicrobenchmark-build
with:
timeout-minutes: 500
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.opmicrobenchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build.outputs.test-matrix }}
secrets: inherit
# B200 runner
opmicrobenchmark-build-b200:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build-b200
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },
]}
secrets: inherit
opmicrobenchmark-test-b200:
name: opmicrobenchmark-test-b200
uses: ./.github/workflows/_linux-test.yml
needs: opmicrobenchmark-build-b200
with:
timeout-minutes: 500
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100
docker-image: ${{ needs.opmicrobenchmark-build-b200.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build-b200.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit
# ROCM MI300 runner
opmicrobenchmark-build-rocm:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build-rocm
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-rocm-py3_10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3-benchmarks
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
]}
secrets: inherit
opmicrobenchmark-test-rocm:
name: opmicrobenchmark-test-rocm
uses: ./.github/workflows/_rocm-test.yml
needs: opmicrobenchmark-build-rocm
with:
timeout-minutes: 500
build-environment: linux-jammy-rocm-py3_10
docker-image: ${{ needs.opmicrobenchmark-build-rocm.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build-rocm.outputs.test-matrix }}
secrets: inherit

View File

@ -59,13 +59,14 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.4-py3.10-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11
cuda-arch-list: 7.5
test-matrix: |
{ include: [
{ config: "legacy_nvidia_driver", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
]}
secrets: inherit
@ -112,13 +113,13 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_9-gcc9-build:
name: linux-jammy-cuda12.8-py3.9-gcc9
linux-jammy-cuda12_8-py3_10-gcc9-build:
name: linux-jammy-cuda12.8-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.9-gcc9
build-environment: linux-jammy-cuda12.8-py3.10-gcc9
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
@ -128,14 +129,14 @@ jobs:
]}
secrets: inherit
linux-jammy-cuda12_8-py3_9-gcc9-test:
name: linux-jammy-cuda12.8-py3.9-gcc9
linux-jammy-cuda12_8-py3_10-gcc9-test:
name: linux-jammy-cuda12.8-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cuda12_8-py3_9-gcc9-build
needs: linux-jammy-cuda12_8-py3_10-gcc9-build
with:
build-environment: linux-jammy-cuda12.8-py3.9-gcc9
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_9-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_9-gcc9-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-debug-build:
@ -212,9 +213,9 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
]}
secrets: inherit

View File

@ -127,8 +127,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan
@ -343,14 +341,14 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-xpu-n-py3_9-build:
name: linux-jammy-xpu-n-py3.9
linux-jammy-xpu-n-py3_10-build:
name: linux-jammy-xpu-n-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
sync-tag: linux-xpu-n-build
runner_prefix: ${{ needs.get-label-type.outputs.label-type }}
build-environment: linux-jammy-xpu-n-py3.9
build-environment: linux-jammy-xpu-n-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-xpu-n-py3
test-matrix: |
{ include: [

View File

@ -38,7 +38,7 @@ jobs:
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-noble-rocm-py3.12-mi355
docker-image-name: ci-image:pytorch-linux-noble-rocm-alpha-py3
docker-image-name: ci-image:pytorch-linux-noble-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
{ include: [

View File

@ -140,8 +140,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -61,3 +61,15 @@ jobs:
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test:
name: linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test
uses: ./.github/workflows/_linux-test-stable-fa3.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.docker-image }}
timeout-minutes: 30
s3-bucket: gha-artifacts
secrets: inherit

View File

@ -160,9 +160,10 @@ jobs:
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
]}
secrets: inherit
@ -189,41 +190,6 @@ jobs:
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
secrets: inherit
linux-jammy-rocm-py3_10-build:
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/trunk') }}
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "distributed", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.4" },
]}
secrets: inherit
linux-jammy-rocm-py3_10-test:
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/trunk') }}
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"
secrets: inherit
inductor-build:
name: inductor-build
uses: ./.github/workflows/_linux-build.yml

View File

@ -23,7 +23,7 @@ jobs:
with:
repository: pytorch/pytorch
stable-branch: viable/strict
requires: '[\"pull\", \"trunk\", \"lint\", \"^linux-binary-manywheel$\", \"^linux-binary-libtorch-release$\", \"linux-aarch64\"]'
requires: '[\"pull\", \"trunk\", \"lint\", \"linux-aarch64\"]'
secret-bot-token: ${{ secrets.MERGEBOT_TOKEN }}
clickhouse-url: ${{ secrets.CLICKHOUSE_URL }}
clickhouse-username: ${{ secrets.CLICKHOUSE_VIABLESTRICT_USERNAME }}
@ -48,4 +48,7 @@ jobs:
echo "{\"sha\": \"${LATEST_SHA}\", \"repository\":\"pytorch/pytorch\", \"timestamp\": ${TIME}}" > "/tmp/${LATEST_SHA}.json"
pip install awscli==1.29.40
aws s3 cp "/tmp/${LATEST_SHA}.json" "s3://ossci-raw-job-status/stable_pushes/pytorch/pytorch/${LATEST_SHA}.json"
# Push new viable/strict tag
cd pytorch/pytorch
git push origin "${LATEST_SHA}:refs/tags/viable/strict/${TIME}"
fi

View File

@ -42,7 +42,7 @@ jobs:
build-external-packages: "vllm"
build-environment: linux-jammy-cuda12.8-py3.12-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm
cuda-arch-list: '8.0;8.9;9.0'
cuda-arch-list: '8.0 8.9 9.0'
runner: linux.24xlarge.memory
test-matrix: |
{ include: [

View File

@ -18,6 +18,7 @@ exclude_patterns = [
'torch/_inductor/autoheuristic/artifacts/**',
'scripts/**',
'test/generated_type_hints_smoketest.py',
'test/test_torchfuzz_repros.py',
# CPython tests
'test/dynamo/cpython/**',
# Tests from the NumPy test suite
@ -27,6 +28,7 @@ exclude_patterns = [
'torch/lib/**',
'venv/**',
'**/*.pyi',
"tools/experimental/dynamic_shapes/torchfuzz/**",
'tools/test/test_selective_build.py',
]
command = [
@ -1260,6 +1262,7 @@ exclude_patterns = [
'test/test_masked.py',
'test/test_maskedtensor.py',
'test/test_matmul_cuda.py',
'test/test_scaled_matmul_cuda.py',
'test/test_meta.py',
'test/test_metal.py',
'test/test_mkl_verbose.py',
@ -1453,7 +1456,7 @@ init_command = [
'--dry-run={{DRYRUN}}',
'usort==1.0.8.post1',
'isort==6.0.1',
'ruff==0.12.9', # sync with RUFF
'ruff==0.13.1', # sync with RUFF
]
is_formatter = true
@ -1570,6 +1573,7 @@ exclude_patterns = [
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'test/dynamo/cpython/**',
'test/test_torchfuzz_repros.py',
'scripts/**',
'third_party/**',
'fb/**',
@ -1587,7 +1591,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.12.9', # sync with PYFMT
'ruff==0.13.1', # sync with PYFMT
]
is_formatter = true

View File

@ -442,7 +442,7 @@ if(WIN32)
message(
WARNING
"Libuv is not installed in current conda env. Set USE_DISTRIBUTED to OFF. "
"Please run command 'conda install -c conda-forge libuv=1.39' to install libuv."
"Please run command 'conda install -c conda-forge libuv=1.51' to install libuv."
)
else()
set(ENV{libuv_ROOT} ${libuv_tmp_LIBRARY}/../../)
@ -888,23 +888,28 @@ cmake_dependent_option(
"(USE_CUDA AND NOT MSVC) OR USE_ROCM"
OFF)
IF(USE_ROCM AND "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
message(WARNING "Setting USE_FBGEMM_GENAI for gfx942 to ON by default, doing ROCM build")
set(USE_FBGEMM_GENAI_DEFAULT ON)
elseif(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8 AND NOT WIN32)
message(STATUS "Setting USE_FBGEMM_GENAI to ON by default , doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI_DEFAULT ON)
else()
set(USE_FBGEMM_GENAI_DEFAULT OFF)
endif()
cmake_dependent_option(
USE_FBGEMM_GENAI
"Whether to build FBGEMM GenAI quantized GEMM kernels.\
Will be disabled if not supported by the platform"
ON
"USE_ROCM"
${USE_FBGEMM_GENAI_DEFAULT}
"(USE_CUDA AND NOT MSVC) OR USE_ROCM"
OFF)
IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
message(WARNING "Unsupported ROCM arch for FBGEMM GenAI, will set USE_FBGEMM_GENAI to OFF")
set(USE_FBGEMM_GENAI off)
endif()
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100.
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8 AND NOT WIN32)
message(STATUS "Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI ON)
endif()
# CAVEAT: Again, Flash Attention2 will error while building for sm52 while Mem

View File

@ -181,15 +181,15 @@ caffe2/utils/hip @jeffdaily @jithunnair-amd
/torch/csrc/jit/python/init.cpp @mikaylagawarecki
# CUDA and CUDA math libraries
aten/src/ATen/cuda/ @eqy @syed-ahmed
aten/src/ATen/cudnn/ @eqy @syed-ahmed
aten/src/ATen/native/cuda/ @eqy @syed-ahmed
aten/src/ATen/native/cudnn/ @eqy @syed-ahmed
c10/cuda @eqy @syed-ahmed
torch/cuda/ @eqy @syed-ahmed
torch/csrc/cuda/ @eqy @syed-ahmed
torch/backends/cuda/ @eqy @syed-ahmed
torch/backends/cudnn/ @eqy @syed-ahmed
aten/src/ATen/cuda/ @eqy @syed-ahmed @Aidyn-A
aten/src/ATen/cudnn/ @eqy @syed-ahmed @Aidyn-A
aten/src/ATen/native/cuda/ @eqy @syed-ahmed @Aidyn-A
aten/src/ATen/native/cudnn/ @eqy @syed-ahmed @Aidyn-A
c10/cuda @eqy @syed-ahmed @Aidyn-A
torch/cuda/ @eqy @syed-ahmed @Aidyn-A
torch/csrc/cuda/ @eqy @syed-ahmed @Aidyn-A
torch/backends/cuda/ @eqy @syed-ahmed @Aidyn-A
torch/backends/cudnn/ @eqy @syed-ahmed @Aidyn-A
# PyTree utilities
/torch/utils/_pytree.py @XuehaiPan

View File

@ -81,7 +81,7 @@ git remote add upstream git@github.com:pytorch/pytorch.git
make setup-env
# Or run `make setup-env-cuda` for pre-built CUDA binaries
# Or run `make setup-env-rocm` for pre-built ROCm binaries
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
### Tips and Debugging
@ -182,28 +182,36 @@ You can use this script to check out a new nightly branch with the following:
```bash
./tools/nightly.py checkout -b my-nightly-branch
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
To install the nightly binaries built with CUDA, you can pass in the flag `--cuda`:
```bash
./tools/nightly.py checkout -b my-nightly-branch --cuda
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
To install the nightly binaries built with ROCm, you can pass in the flag `--rocm`:
```bash
./tools/nightly.py checkout -b my-nightly-branch --rocm
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
You can also use this tool to pull the nightly commits into the current branch:
```bash
./tools/nightly.py pull -p my-env
source my-env/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
./tools/nightly.py pull
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
To create the virtual environment with a specific Python interpreter, you can
pass in the `--python` argument:
```bash
./tools/nightly.py --python /path/to/python3.12
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
Pulling will recreate a fresh virtual environment and reinstall the development

View File

@ -50,11 +50,10 @@ RUN git submodule update --init --recursive
FROM conda as conda-installs
ARG PYTHON_VERSION=3.11
ARG CUDA_PATH=cu121
ARG CUDA_CHANNEL=nvidia
ARG INSTALL_CHANNEL=whl/nightly
# Automatically set by buildx
RUN /opt/conda/bin/conda update -y -n base -c defaults conda
RUN /opt/conda/bin/conda install -y python=${PYTHON_VERSION}
# pinning version of conda here see: https://github.com/pytorch/pytorch/issues/164574
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -y python=${PYTHON_VERSION} conda=25.7.0
ARG TARGETPLATFORM

Some files were not shown because too many files have changed in this diff Show More