Compare commits

..

425 Commits

Author SHA1 Message Date
0f49e915a9 rebase 2025-05-30 14:30:12 -07:00
2f1217f944 benchmarking 2025-05-30 14:27:37 -07:00
e0bf01e87b Script for consolidation of sharded safetensor files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154743

Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)
ghstack-source-id: 287291639
2025-05-30 14:18:51 -07:00
3b5ae0e9fc Support re-sharding for safetensors checkpoints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154519

This change will add the ability to support re-sharding for hf safetensors checkpoints.
This is done by adding more metadata when saving each file. This metadata captures the size and offset of the saved shard. This can be used to re-shard on load by using this information to create the chunks belonging to TensorStorageMetadata class.

Differential Revision: [D75226344](https://our.internmc.facebook.com/intern/diff/D75226344/)
ghstack-source-id: 286572125
2025-05-30 10:40:32 -07:00
5f5f654a3e Updates to HFStorageReader to use TensorStorageMetadata instead of BytesStorageMetadata
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154518

As we prepare to support re-sharding, the current approach of using BytesStorageMetadata to read safetenstors won't work anymore. Before, we didn't need to read the metadata of the safetensors file from its header because we were just loading the contents of the file directly into tensors with safetensor.load() that would handle the metadata and deserialization. But now, in preparation of handling re-sharding, we need to read the metadata directly from the header of the safetensors file and store it directly in TensorStorageMetadata objects so that we can perform re-sharding. Re-sharding won't currently work, as we need extra metadata to be stored on each save, so that will be added in a subsequent PR.
In addition this PR adds an integration test in addition to the unit tests.
It also removes the HfFileSystem import because that's only needed if users are using HfFileSystem, but we want to support any backend.
ghstack-source-id: 286649070
@exported-using-ghexport

Differential Revision: [D74891998](https://our.internmc.facebook.com/intern/diff/D74891998/)
2025-05-30 10:40:30 -07:00
21931cbbc6 Changes to HFStorageWriter to support saving shards of tensors
As we move towards supporting saving partial tensors natively with HFStorageWriter, there are some simple changes that need to be made to make this happen.
- The current approach for distributed writes is that every rank has full tensors, but we split up the writing of these full tensors across all available ranks. We're removing this logic that was in the HFSavePlanner and instead assuming that every rank has a shard and saving every rank's local state
    -  as a result we can probably remove the HFSavePlanner, but keeping it as a placeholder for now

- the current naming of files doesn't support shards as its in the format "model-00001-of-00004.safetensors", but if every rank is writing the same file names they will overwrite eachother, so this adds a shard-00001 prefix, so that the rank files don't overwrite eachother
- don't save the metadata file models.safetensors.index.json if sharding is enabled. This file expects a 1 to 1 ratio between tensor and filename, but this doesn't make sense in the sharded saving approach, so we can just get rid of this file
- make the "fqn_to_file_index" map optional. This is to describe which files to save which tensors in, but if users don't want to provide this, we can just save all the tensors to one file. If they run into issues, they can choose how to split up their tensors to be more friendly with 5GB HF remote storage file size soft limit.

Differential Revision: [D75099862](https://our.internmc.facebook.com/intern/diff/D75099862/)

ghstack-source-id: 286648122
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154742
2025-05-30 10:40:28 -07:00
ef4d57329b [CAG] Support for call_module at copy paste aot bwd graph (#153827)
Support for `call_module` in `copy_paste_aot_backward_graph` added recently with PT2.7

Problem is being observed with HPU backend in example repro due to creating fused modules.

```
import torch

device = 'cpu' #'hpu'
backend = 'inductor' #'hpu_backend'

def fn(t1):
    t1 = t1 * 1
    t1_grad = torch.ones_like(t1, device=device)
    t1.backward(t1_grad, retain_graph=True)
    return t1

t1 = torch.ones(1, requires_grad=True, device=device) #.squeeze()
compiled_fn = torch.compile(fn, backend=backend)
result = compiled_fn(t1)

with torch._dynamo.compiled_autograd._enable(torch.compile(backend=backend)):
    result_grad = torch.ones_like(result, device=device)
    result.backward(result_grad)

print(f'{result_grad=}')
print(f'{t1.grad=}')
```

With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze):
`torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(FakeTensor(..., device='hpu:0', size=()), 0), **{}): got IndexError('invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number')`

While on CPU there's following warning and None returned:
`repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.)
  print(f'{t1.grad=}')
t1.grad=None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153827
Approved by: https://github.com/xmfan
2025-05-28 22:52:40 +00:00
d62a33c002 [ez] add docblock for _expandsums (#154397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154397
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398, #154396, #154399
2025-05-28 22:43:26 +00:00
0c00e32632 [ez] add docblock for _eval_is_non_overlapping_and_dense (#154399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154399
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398, #154396
2025-05-28 22:40:03 +00:00
0f56318152 [precompile] Add Exception type PackageError for unsupported precompile features. (#154430)
Summary:
Today when guard serialization fails, dynamo will raise an internal error like:

```
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CLOSURE_MATCH guard cannot be serialized.
```

Adding a dedicated PackageError type to surface the error more clearly.

Test Plan: CI

Differential Revision: D75452124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154430
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-05-28 22:34:51 +00:00
11129d9317 Add new ops in fallback ops (#154251)
Fixes #ISSUE_NUMBER

## Background

Task: [T222738229](https://www.internalfb.com/intern/tasks/?t=222738229)

It's the first starter task on the project **_Enabling TorchNative Standalone on Whisper_**.  We are using cshim to create a layer of abstraction between _**libtorch**_ and **_AOTInductor generated artifacts_**.

So we needed to add an entry in the cshim for every API surface in libtorch. And we only care about operators that AOTInductor does not handle. And for this task, we only wanted to add it for the following ops.

## What I've done?

4 new fallback ops are added that show up in the Whisper model. (torchgen/aoti/fallback_ops.py)

- aten.permute (default)
- aten.squueze (dim)
- aten.abs (default)
- aten.hann_window (default)

Then I ran the below command to generate new header C shim header files. As it says [here](7e86a7c015/torchgen/gen.py (L2424-L2436%20for%20details))
`python torchgen/gen.py --update-aoti-c-shim`

Then, `python setup.py develop` to rebuild PyTorch

## Testing

Also 4 new tests have been added on test/inductor/test_aot_inductor.py

- test_proxy_executor_permute
- test_proxy_executor_abs
- test_proxy_executor_squeeze
- test_proxy_executor_hann

I ran these commands to test it (inside local pytorch root folder):

`python test/inductor/test_aot_inductor.py -k test_proxy_executor_permute`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_abs`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_squeeze`
`python test/inductor/test_aot_inductor.py -k test_proxy_executor_hann`

## NOTE:
I didn't see any order between the tests inside _test/inductor/test_aot_inductor.py_. That's why, I added new tests just after the test given in the example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154251
Approved by: https://github.com/angelayi
2025-05-28 22:11:07 +00:00
d2f506cae8 [ca] disable ca for functorch grad and run all HOO tests (#154147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154147
Approved by: https://github.com/zou3519
ghstack dependencies: #154133
2025-05-28 22:06:13 +00:00
857f21631d [ca] fix hop_db tests (#154133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154133
Approved by: https://github.com/zou3519
2025-05-28 22:06:13 +00:00
ed348e7026 Add docblock for TrackedFake (#154396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154396
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400, #154398
2025-05-28 21:19:49 +00:00
d311b79c12 add docblock for _fast_expand (#154398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154398
Approved by: https://github.com/laithsakka
ghstack dependencies: #154400
2025-05-28 21:16:47 +00:00
e7318b863d [ez] add docblock to cast_symbool_to_symint_guardless (#154400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154400
Approved by: https://github.com/laithsakka
2025-05-28 21:11:53 +00:00
f6dcc45c44 [Kineto x Insight] Add device to activity type map in pytorch (#154253)
Summary: Update the device to ActivityType Map in pytorch. Need to be exported to github

Test Plan:
Run the ondemand e2e test and insight profiler is triggered during profiling
P1819539581: https://www.internalfb.com/intern/paste/P1819539581/
{F1978519960}

Insight profiler is not enabled when mtia_insight not specifying in config
{F1978527200}

Reviewed By: fenypatel99

Differential Revision: D75246621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154253
Approved by: https://github.com/Skylion007
2025-05-28 20:36:19 +00:00
e25074d462 [c10d][CI] Change expected return code in Sandcastle for Nan tests (#154441)
Fixing internal error caused by #153167.

`skip_but_pass_in_sandcastle_if` returns exit code 0. But `test_nan_assert` expects exit code -6.
So we'd need to set expected return code conditional on `IS_SANDCASTLE`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154441
Approved by: https://github.com/fduwjj, https://github.com/nWEIdia
ghstack dependencies: #153167
2025-05-28 20:35:52 +00:00
c381103fd7 Fix the logic of set_cpu_affinity (#154503)
While investigating https://github.com/pytorch/pytorch/issues/152566, I found two issues with how the cpu affinity is set in benchmark job:

* The current logic doesn't work with cgroups slice, the mechanism behind multi-tenant runner:
    * Using `lscpu` returns all CPUs and not the available ones from cgroups.  On the other hand, `nproc` works correctly.  For example, on H100, `lscpu` returns 192 CPUs while `nproc` returns 24 (192 / 8)
    * Setting `taskset -c 0-N` blindly is wrong because CPU 0 is only available to the the first tenant, aka alice.  For example, running `taskset -c 0 ls` on any other tenants will fail. To fix this, the ID of available CPUs can be fetched by calling `os.sched_getaffinity(0)`.
* The last bug is `taskset` works with logical CPUs https://www.man7.org/linux/man-pages/man1/taskset.1.html, so using the result from `test_inductor_get_core_number` is also wrong because that function returns the number of physical CPUs.

### Testing

CPU benchmark jobs look ok

* [aarch64 torch.compile benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2021%20May%202025%2016%3A40%3A28%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A40%3A28%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=fix-cpu-affinity-cgroups&lCommit=9a6288e083d650c470623f5fe136b1060824021c&rBranch=main&rCommit=dec5ab8d984b8a608140911351d877b9ddb141c2)
* [x86 micro benchmark](https://hud.pytorch.org/benchmark/llms?startTime=Wed%2C%2021%20May%202025%2016%3A41%3A26%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A41%3A26%20GMT&granularity=day&lBranch=main&lCommit=c1b7dbc52aaa49f4cd147bbe5935110a4a10e3e3&rBranch=refs/tags/ciflow/inductor-micro-benchmark-cpu-x86/154503&rCommit=9a6288e083d650c470623f5fe136b1060824021c&repoName=pytorch%2Fpytorch&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=cpu%20(x86_64)&archName=All%20Platforms)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154503
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-28 19:38:20 +00:00
66f53889d5 [nativert] port semaphore to c10 util (#153504)
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore

gonna need a oss build export to take a look at this...

Test Plan: CI

Differential Revision: D73882656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504
Approved by: https://github.com/zhxchen17
2025-05-28 19:17:30 +00:00
24980d2641 [ROCm][CI] Update build-environment for mi300 workflows (#153134)
so their test times are tracked separately in https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/test-times.json. Currently, both MI200 and MI300 test times get combined into the same key `linux-focal-rocm-py3.10`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153134
Approved by: https://github.com/huydhn
2025-05-28 19:04:53 +00:00
d4ab8e74f3 Revert "Fix the Problems About Defining Static Variable in Inline Function (#147095)"
This reverts commit c6fc11af760d4ad1f01cc699a3c6488ab5f41770.

Reverted https://github.com/pytorch/pytorch/pull/147095 on behalf of https://github.com/izaitsevfb due to still fails to link internally at meta ([comment](https://github.com/pytorch/pytorch/pull/147095#issuecomment-2917221575))
2025-05-28 18:22:39 +00:00
1c7a70b483 [AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)
Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/)

In general, we want to cache the cutlass kernels.

Also saw an error saying .o not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155
Approved by: https://github.com/chenyang78
2025-05-28 17:35:19 +00:00
66ac724b56 pyfmt lint torch/_export/passes/replace_view_ops_with_view_copy_ops_pass.py (#154488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154488
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484, #154485, #154487
2025-05-28 17:07:15 +00:00
dfe0f48123 pyfmt lint torch/_export/serde/schema.py (#154487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154487
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484, #154485
2025-05-28 17:07:15 +00:00
92cebed1bd pyfmt lint torch/_export/serde/serialize.py (#154485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154485
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483, #154484
2025-05-28 17:07:07 +00:00
b4fe5ca58a pymft lint torch/utils/weak.py (#154484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154484
Approved by: https://github.com/Skylion007
ghstack dependencies: #154483
2025-05-28 17:06:58 +00:00
4de1b25df7 Remove empty files from execlude lint rule (#154483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154483
Approved by: https://github.com/Skylion007
2025-05-28 17:06:50 +00:00
70539308ac [dynamo] updating gb_type names for uniqueness (#154452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154452
Approved by: https://github.com/williamwen42
2025-05-28 16:54:10 +00:00
e313152a33 SDPA fix memory efficient attention for large batch dim (#154029)
Fixes #146704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154029
Approved by: https://github.com/ngimel
2025-05-28 16:53:53 +00:00
3b38989b5f Remove MemPoolContext (#154042)
Removes MemPoolContext from custom user mempools. The ground truth for which pool should be used is in graph_pools active pool, and MemPoolContext just introduced an opportunity for the pool pointed to by MemPoolContext and active pool in graph_pools to go out of sync (see all the asserts in the code to make sure that happens, and yet it still could happen in a multithread scenario, see my recent PRs (#153990).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154042
Approved by: https://github.com/albanD, https://github.com/syed-ahmed
2025-05-28 16:35:48 +00:00
d23aa7e182 Add deprecation warning for torch.ao.quantization (#153892)
Summary:
att

Test Plan:
(ao) $ PYTHONWARNINGS='default' python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.ao.quantization.quantizer.xnnpack_quantizer import XNNPACKQuantizer
printing warning
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/__init__.py:36: DeprecationWarning: torch.ao.quantization is deprecated. Plan is to
1. Remove eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead
2. Remove fx graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx, torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e)
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e)
see https://dev-discuss.pytorch.org/t/torch-ao-quantization-migration-plan/2810 for more details
  warnings.warn(
>>> a = XNNPACKQuantizer()
*/anaconda3/envs/ao/lib/python3.10/site-packages/torch/ao/quantization/quantizer/xnnpack_quantizer.py:281: DeprecationWarning: XNNPACKQuantizer is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead
  warnings.warn(f"{self.__class__.__name__} is deprecated! Please use xnnpack quantizer in ExecuTorch (https://github.com/pytorch/executorch/tree/main/backends/xnnpack/quantizer) instead", DeprecationWarning)
>>>

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153892
Approved by: https://github.com/Skylion007
2025-05-28 16:25:30 +00:00
5bf74753f6 [precompile] Prune local scope variables for guard serialization. (#154431)
Summary: Prune unused local objects from serialized local scope if they are not used in guard reconstruction. This is helpful when a user program takes things like local callable functions or the function call is recursive.

Test Plan:
test/dynamo/test_guard_serialization.py -k test_function_locals

Before pruning locals:
```
state = GuardsState(output_graph=OutputGraphGuardsState(local_scope={'x': tensor([ 0.0461,  0.4024, -1.0115]), 'g': <function ...aints=None, _guards=<torch._guards.GuardsSet object at 0x7fbccc7e9fc0>, _aotautograd_guards=[]), shape_code_parts=None)

    def pickle_guards_state(state: GuardsState) -> bytes:
        buf = io.BytesIO()
        pickler = GuardsStatePickler(buf)
        try:
            pickler.dump(state)
        except AttributeError as e:
>           raise torch._dynamo.exc.PackageError(str(e)) from e
E           torch._dynamo.exc.PackageError: Can't pickle local object 'TestGuardSerialization.test_function_locals.<locals>.foo'
```
After the diff
```
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D75452123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154431
Approved by: https://github.com/jansel
2025-05-28 16:03:02 +00:00
9db7bcb3fe [Dynamo] Introduce hook receiving list of traced code objects (#153622)
This PR:
* Expands `Hooks` with a new, optional `frame_traced_fn` field. It should be a callable receiving the list of traced code objects
* Maintains a list of `traced_code` objects in the `TracingContext` of an `OutputGraph`
    *  Whenever an `inline_call()` is encountered, the corresponding code object is added to this set
    * `OutputGraph`'s associated `f_code` is added to the list just before the hook is called

I believe use of this hook should enable the source code hashing that vLLM does in a better way than monkey-patching `inline_call()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153622
Approved by: https://github.com/jansel
2025-05-28 15:40:09 +00:00
476e0a643a [ez] add docblock for ShapeGuardPythonPrinter (#154403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154403
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385, #154402
2025-05-28 14:17:17 +00:00
473a93eb58 [ez] add docblock for _ShapeGuardPrinter (#154402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154402
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384, #154385
2025-05-28 14:13:22 +00:00
35a473e364 [ez] add docblock for guard_scalar (#154385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154385
Approved by: https://github.com/jingsh
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383, #154384
2025-05-28 14:10:07 +00:00
ee4f433963 [ez] add docblock for _guard_or (#154384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154384
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381, #154383
2025-05-28 14:06:29 +00:00
e9b97d19b1 [ez] Make SymNodeImpl comments less misleading (#154480)
As discussed in DS workchat, it's easy for users to get confused by
guarding for these supposedly non-guarding methods. The TL;DR is in the
case of non pythonic compilers like XLA, we actually do guard. I've
updated the comments accordingly to reduce confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154480
Approved by: https://github.com/pianpwk, https://github.com/Skylion007
2025-05-28 14:04:32 +00:00
a75e3a02be Revert "[dynamo, nested graph breaks] small fixes to resume function generation (#151056)"
This reverts commit 28e7aa21c522e92ea01a62dfdc5e3b74e398d8f0.

Reverted https://github.com/pytorch/pytorch/pull/151056 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
9603d6382d Revert "[dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)"
This reverts commit 1fe98429222a8ba5e16dd9381f50a8fb90edcf0e.

Reverted https://github.com/pytorch/pytorch/pull/153510 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
5fd7004dc9 Revert "[dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)"
This reverts commit 9a66c30bdc563c62375e5030c4103b67515b8dac.

Reverted https://github.com/pytorch/pytorch/pull/153772 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
e86439ed5b Revert "[dynamo, nested graph breaks] add skip_frame debugging function (#153773)"
This reverts commit aadf9eae63c4793e1107a3b21ede30e5289eeaca.

Reverted https://github.com/pytorch/pytorch/pull/153773 on behalf of https://github.com/malfet due to Not sure which one, but it broke test_error_messages, see 203b0efd63/1 ([comment](https://github.com/pytorch/pytorch/pull/151056#issuecomment-2916437433))
2025-05-28 13:53:50 +00:00
203b0efd63 [PP] Allow unused kwargs in ZB path (#153498)
This is a fix when an unused kwarg is in the PP stage forward, we try to call `torch.autograd.grad()` and update its gradients when it shouldn't have gradients. Leading to this error:

```
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/stage.py", line 613, in
[rank3]:[rank3]: return lambda: stage_backward_input(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/distributed/pipelining/_backward.py", line 199, in stage_backward_input
[rank3]:[rank3]: dinputs = torch.autograd.grad(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/init.py", line 503, in grad
[rank3]:[rank3]: result = _engine_run_backward(
[rank3]:[rank3]: File "/data/users/howardhuang/pytorch/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]:[rank3]: RuntimeError: One of the differentiated Tensors does not require grad
```

related issues: https://github.com/pytorch/torchtitan/issues/1188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153498
Approved by: https://github.com/kwen2501
2025-05-28 13:34:04 +00:00
cf7451f279 Fix signature of torch.sparse_coo_tensor() (#152681)
Fixes #145371

@pearu Searched all and find these codes, wondering whether is the root cause of the issue, could you have a review? Thanks a lot!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152681
Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/nikitaved
2025-05-28 13:16:41 +00:00
f58143b945 [Typing] Refactor torch.types.Device in torch/cuda/__init__.py (#153447)
Part of: #152952
Follow up: #153027

Here is the definition of `torch.types.Device`:

ab997d9ff5/torch/types.py (L74)

So `Optional[Union[Device, int]]` is equivalent to `torch.types.Device`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153447
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-28 10:09:31 +00:00
fdc339003b Revert "[AOTI] Support multi-arch when using package_cpp_only (#154414)"
This reverts commit a84d8c4a1cc515db274366537afd0b1492800c2d.

Reverted https://github.com/pytorch/pytorch/pull/154414 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm trunk job ([comment](https://github.com/pytorch/pytorch/pull/154414#issuecomment-2915597821))
2025-05-28 09:23:31 +00:00
853958f82c Fix: Replacements can cause runtime assertions to disappear and can cause invalid inductor code. (#153661)
Lets explore firs a couple of problem related to replacements and runtime assertions.

#### example problem 1
if we have a runtime assertions that u0==s0, u0 is an input coming from mark_unbacked. A replacement u0=s0 will be added, the function f(u0, s0) will become f(s0, s0), this leads to the assert  not being inserted during insert_deferred_runtime_asserts.
The reason is that insert_deferred_runtime_asserts logic insert each assertion once all its inputs are seen,  but u0 will never be seen. Same thing can happen when we defer assertion on backed i.e: s0==s2 ..etc.

#### example problem 2
Consider u0==s0, where u0 is coming from a call to .item() Imagine later on that a specialization happens to s0 to become 2. In that case s0 as input wont be seen during insert_deferred_runtime_asserts and the assertion won't be inserted in the graph. Worse, Inductor will generate some code that refers to s0 in the cpp wrapper while it does not exist, causing a failure.
internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1669766396994898/

## The solution :
Runtime assertions insertion loops depend on detecting that the symbols that are used in the runtime assertions are seen, note that those symbols are either graph inputs or generated in the graph from data dependent ops like .item().

The issues above happen when symbols are graph inputs, in order to force the symbols to exist in the graph and to be seen by the runtime assertions we do not do replacements on placeholders expressions during codegen and during runtime assertions insertion.

This should not have performance overhead, since we already optimized the graph with replacements, the only effect is not mistakenly dropping graph inputs that are used in runtime assertions.
I added extended testing. A solo unrelated follow up that I noticed, is that we might want to rename unbacked symbols in runtime assertions when we do unbacked renaming, but that's a different issue.

Other approaches that did not work :
#### ban replacements on unbacked.
1. does not work when we defer runtime assertions on backed ex: s0==s1. we could also ban such replacements
but problem 2 becomes more problematic.
2. Problem two, it affects the quality of reasoning ! in a bad way.

#### Apply specialization on runtime assertions before codegen .
1. Can fix some issues, but may lead also to runtime assertions becoming NOPs.
2. Does not fix the issue if not inserting runtime assertions during insert_deferred_runtime_asserts due to input not being detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153661
Approved by: https://github.com/jansel
2025-05-28 09:08:05 +00:00
aadf9eae63 [dynamo, nested graph breaks] add skip_frame debugging function (#153773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153773
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510, #153772
2025-05-28 08:54:09 +00:00
9a66c30bdc [dynamo, nested graph breaks] remove block stack graph break in output_graph (#153772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153772
Approved by: https://github.com/jansel
ghstack dependencies: #151056, #153510
2025-05-28 08:54:09 +00:00
1fe9842922 [dynamo, nested graph breaks] refactor codegen to minimize NULL codegen'ing (#153510)
Stop codegening NULLs that we need to pop later. Some output_graph.py changes to prepare for nested graph break support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153510
Approved by: https://github.com/jansel
ghstack dependencies: #151056
2025-05-28 08:54:09 +00:00
28e7aa21c5 [dynamo, nested graph breaks] small fixes to resume function generation (#151056)
Old: ~pack resume function stack + locals into a list: we need to be able to pass frame stack+locals in lists to hand off to nested functions in the future, so we implement this part first.~

We are no longer doing this right now since GraphModule/guard variable naming gets messed up. Going forward, our approach will be to keep the top frame unpacked, but pack the rest of the contents of other frames in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151056
Approved by: https://github.com/jansel
2025-05-28 08:54:09 +00:00
cyy
9d04c0f352 Remove outdated CUDA 11 conditions (#154313)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313
Approved by: https://github.com/eqy
2025-05-28 08:44:58 +00:00
1d9b7dd2d1 [PGO] suggest dynamic whitelist for recompilations (#154189)
suggests `TORCH_COMPILE_DYNAMIC_SOURCES` based off tensor size changes in PGO code state, including parameters.

Closing #153442 which took the dynamo guards approach.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154189
Approved by: https://github.com/bobrenjc93
2025-05-28 07:11:43 +00:00
fe760b6636 [ez] add docblock for _free_unbacked_symbols_with_path (#154383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154383
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380, #154381
2025-05-28 05:53:50 +00:00
8e25ba6963 [ez] add docblock for find_symbol_binding_fx_nodes (#154381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154381
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379, #154380
2025-05-28 05:44:26 +00:00
08c29deb5f [ez] add docblock to is_symbol_binding_fx_node (#154380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154380
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378, #154379
2025-05-28 05:41:19 +00:00
07405a6cff [ez] add docblock for free_unbacked_symbols (#154379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154379
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377, #154378
2025-05-28 05:37:25 +00:00
dcdaef5206 [ez] add docblock for free_symbols (#154378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154378
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405, #154377
2025-05-28 05:34:25 +00:00
abc3fdc7ac [ez] add docblock for _iterate_exprs (#154377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154377
Approved by: https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404, #154405
2025-05-28 05:28:58 +00:00
ab6cb85cb0 [ez] add docblock for _remove_effect_token_unbacked_bindings (#154405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154405
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
ghstack dependencies: #154374, #154375, #154376, #154386, #154401, #154404
2025-05-28 05:16:14 +00:00
fde8f6a8b8 [ez] add docblock for _suggest_torch_checks (#154404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154404
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376, #154386, #154401
2025-05-28 04:45:55 +00:00
b82fb57b67 [ez] add docblock for RuntimeAssert (#154401)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154401
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376, #154386
2025-05-28 04:43:22 +00:00
d64b4a91dd [ez] remove unused function _constrain_symbol_range (#154386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154386
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375, #154376
2025-05-28 04:41:00 +00:00
ef90cc18d7 use definitely_contiguous for _prim_elementwise_meta short circuit (#153441)
*
This verifies that the check short circuit is not material. https://github.com/pytorch/pytorch/pull/153431
```
import torch
from torch.export import Dim, export
class MyModel(torch.nn.Module):
    def forward(self, x, ranks):
        first_k = ranks.max().item()
        torch._check_is_size(first_k)
        narrow = x.narrow(dim = 1, start = 0, length = first_k)
        lt = narrow < narrow.size(1)
        return lt
inps = (
    torch.randn((8, 16), device="cuda"),
    torch.arange(8, device="cuda", dtype=torch.int8)
)
spec = {
    "x": (Dim.AUTO, Dim.AUTO),
    "ranks": (Dim.AUTO,),
}
traced = export(MyModel(), inps, dynamic_shapes=spec, strict=True).run_decompositions({})

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153441
Approved by: https://github.com/jansel
ghstack dependencies: #153432
2025-05-28 03:41:26 +00:00
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
54f1f29fed [dynamo] dynamic gb_type -> static gb_type (#154435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154435
Approved by: https://github.com/williamwen42
2025-05-28 03:14:26 +00:00
f12ce4e36b [Intel GPU] convolution fusion at XPU backend (#154202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154202
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/etaf
ghstack dependencies: #140365
2025-05-28 03:14:18 +00:00
c6fc11af76 Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move the inline function that defines the static variable to .cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-28 02:47:16 +00:00
855eff8e8e Don't CSE unbacked nodes (#154387)
* #154440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154387
Approved by: https://github.com/TroyGarden
ghstack dependencies: #154440
2025-05-28 02:21:56 +00:00
919a1a17e3 [ez] Replace misleading implementations with NYI (#154440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154440
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
2025-05-28 02:21:56 +00:00
a84d8c4a1c [AOTI] Support multi-arch when using package_cpp_only (#154414)
Summary: Add support of multi_arch_kernel_binary in the package_cpp_only mode. More specifically, generate specific cmake targets to compile .ptx to .fatbin and embed them in the final shared library or binary.

Differential Revision: [D75452096](https://our.internmc.facebook.com/intern/diff/D75452096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154414
Approved by: https://github.com/angelayi
ghstack dependencies: #154412, #154413
2025-05-28 01:20:38 +00:00
cde82d25b7 [AOTI] Add a multi_arch_kernel_binary option (#154413)
Summary: CUDA can support multi-arch with the fatbin format. Add this multi_arch_kernel_binary option, so the compiled model binary can run across different GPU archs.

Differential Revision: [D75452094](https://our.internmc.facebook.com/intern/diff/D75452094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154413
Approved by: https://github.com/angelayi
ghstack dependencies: #154412
2025-05-28 01:20:38 +00:00
4d8f3d537a [AOTI][refactor] Rename embed_cubin to embed_kernel_binary (#154412)
Summary: Rename as it is not CUDA specific.

Differential Revision: [D75452095](https://our.internmc.facebook.com/intern/diff/D75452095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154412
Approved by: https://github.com/angelayi
2025-05-28 01:20:28 +00:00
e79790e14b [ez] add docblock for _sympy_from_args (#154376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154376
Approved by: https://github.com/Skylion007
ghstack dependencies: #154374, #154375
2025-05-27 23:43:13 +00:00
fe082c5ffe Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)
Trying to fix: https://github.com/pytorch/pytorch/issues/154157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever
2025-05-27 23:16:21 +00:00
3f10c9d8af Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)
Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245
Approved by: https://github.com/jeffdaily
2025-05-27 23:10:25 +00:00
4b39832412 [CI] Update torchbench pin (#154453)
Related to https://github.com/pytorch/pytorch/issues/154446
Pins torchbench repo to a https://github.com/pytorch/benchmark/pull/2620 which pins opacus to ``1.5.3`` version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154453
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-05-27 23:08:42 +00:00
247ea229ba Create issue template: Release highlight for proposed Feature (#154125)
Authors: @anitakat @atalman

This is related to: https://github.com/pytorch/pytorch/issues/152134 . Adding RFC template for feature submissions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154125
Approved by: https://github.com/anitakat, https://github.com/ZainRizvi, https://github.com/albanD
2025-05-27 22:45:21 +00:00
53affa273b [MTIA Aten Backend][1.3/n] Migrate remaining view ops, which all need explicit register in native_functions.yaml (#154337)
See context in D75266206.

This diff/PR migrates all the remaining view ops, which all need changes in `native_functions.yaml` and thus need to be exported to PR.

Ops covered by this diff:
- _reshape_alias
- unfold

internal: Also delete the entire aten_mtia_view_ops.cpp file, and update corresponding build config.

Differential Revision: [D75385411](https://our.internmc.facebook.com/intern/diff/D75385411/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154337
Approved by: https://github.com/nautsimon
ghstack dependencies: #154336
2025-05-27 22:18:12 +00:00
eaf355cb11 [BE] Clean up unused parameter input in AOTIModel (#154276)
Summary: As title

Test Plan: CI

Differential Revision: D74691763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154276
Approved by: https://github.com/Skylion007
2025-05-27 22:17:32 +00:00
241f8dc84d Revert "Remove outdated CUDA 11 conditions (#154313)"
This reverts commit 3936e6141c09dab94f21e4fdab7bea4bddf62ac2.

Reverted https://github.com/pytorch/pytorch/pull/154313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/154313#issuecomment-2914230005))
2025-05-27 21:54:41 +00:00
6be829535f [ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)
* Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300
* Use thread_work_size of 8 or 16 for vectorized elementwise kernel

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634
Approved by: https://github.com/jeffdaily
2025-05-27 20:49:32 +00:00
555fc05868 Revert "[Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)"
This reverts commit 6169ca0b65bcb382faa1a2287278b3717c18f127.

Reverted https://github.com/pytorch/pytorch/pull/154371 on behalf of https://github.com/benjaminglass1 due to Appears to have broken main ([comment](https://github.com/pytorch/pytorch/pull/154371#issuecomment-2913975736))
2025-05-27 20:39:09 +00:00
7359705232 Add CPython tests for unittest (#150788)
Tests:
* test_assertions.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150788
Approved by: https://github.com/williamwen42
2025-05-27 20:26:17 +00:00
12fc06d267 Add CPython complex tests (#152015)
Tests:
* test_complex.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152015
Approved by: https://github.com/williamwen42
2025-05-27 20:24:28 +00:00
3b218e56dc Add CPython tests for iter/sort (#150797)
Tests:
* test_iter.py
* test_sort.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150797
Approved by: https://github.com/williamwen42
2025-05-27 20:22:34 +00:00
4fd8a54a41 [ez] add docblock for is_accessor_node (#154375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154375
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
ghstack dependencies: #154374
2025-05-27 19:47:32 +00:00
b367e5f6a6 [ROCm][Windows] Fix building torch 2.8 wheel with ROCm (added hipblasLt and rocblas directories) (#153144)
Since rocblas.dll and hipblaslt.dll are copied to torch/lib, rocblas and hipblaslt directories are needed to be stored there too (otherwise we have an error after wheel installation while searching for files in rocblas/library and hipblaslt/library which doesn't exist). This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153144
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-27 19:40:28 +00:00
fa6ca59079 Revert "Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)"
This reverts commit 2bd95f3a1f07132aa00f5c438c5228866d7dd1f8.

Reverted https://github.com/pytorch/pytorch/pull/154153 on behalf of https://github.com/malfet due to Broke inductor tests, see b8452e55bc/1 ([comment](https://github.com/pytorch/pytorch/pull/154153#issuecomment-2913738047))
2025-05-27 19:23:28 +00:00
6169ca0b65 [Inductor] Improve typing, and prepare for ABI-compatible AOTI C-shim dispatching (#154371)
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:

1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions

As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
2025-05-27 19:17:41 +00:00
75bbd4989c [dynamo] Support using symint from dispatcher-style tensor subclass (#154130)
Fixes #146932.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154130
Approved by: https://github.com/laithsakka
2025-05-27 19:05:46 +00:00
8c0f07f944 Revert "[ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)"
This reverts commit 0d4de7872ac019abbd6e87b3391b2276d9d05bd4.

Reverted https://github.com/pytorch/pytorch/pull/153634 on behalf of https://github.com/malfet due to Broke inductor jobs, see b8452e55bc/1 ([comment](https://github.com/pytorch/pytorch/pull/153634#issuecomment-2913619071))
2025-05-27 19:02:59 +00:00
b8452e55bc [Kineto x Insight] Update Kineto submodule (#154426)
Summary: We add a new ActivityType::MTIA_INSIGHT in 20f652846f

Test Plan: CI

Differential Revision: D75454945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154426
Approved by: https://github.com/Skylion007
2025-05-27 18:29:29 +00:00
5075df6fee Make torch importable if compiled without TensorPipe (#154382)
By delaying the import/hiding it behind `torch.distributed.rpc.is_tensorpipe_avaiable()` check
Fixes https://github.com/pytorch/pytorch/issues/154300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154382
Approved by: https://github.com/Skylion007
ghstack dependencies: #154325
2025-05-27 18:13:38 +00:00
f472ea63bb [BE] Fix typos in SyntaxError description (#154436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154436
Approved by: https://github.com/seemethere, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-27 18:08:58 +00:00
cfbd99fdfd [Pytorch] Add option to CPU Blas GEMM to avoid output downcast (#154012)
Summary:
Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t):
1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t)
2. vector reduction to a scalar value (opmath_t -> opmath_t)
3. optional downcast if opmath_t != out_t

The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel

Test Plan: Attention CI passes

Differential Revision: D75023858

topic: not user facing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154012
Approved by: https://github.com/Valentine233, https://github.com/aditew01, https://github.com/CaoE, https://github.com/drisspg
2025-05-27 17:43:21 +00:00
1ca082d9a1 [ez] Rewrite comment to be more friendly to non haskellers (#151421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151421
Approved by: https://github.com/aorenste
2025-05-27 17:32:34 +00:00
70fbd5e08c [ez] Add docblock for resolve_unbacked_bindings (#154374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154374
Approved by: https://github.com/Skylion007, https://github.com/pianpwk
2025-05-27 17:05:49 +00:00
2560c1f3f0 add sticky cache pgo (#154418)
It's a reland of https://github.com/pytorch/pytorch/pull/154394 that hit some mergebot bug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154418
Approved by: https://github.com/malfet
2025-05-27 16:40:18 +00:00
514409d032 update torchvision pin (#154255)
Fixes #153985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
0ddfd1ed43 [Intel GPU] Enable mkdnn._linear_pointwise at XPU backend (#140365)
# Motivation

This PR is intended to add post-op fusion support fo Linear. The liner-pointwise fusion is expected to be used in graph mode like torch.compile. The FusionUtils.cpp file defines a utilization APIs for generating primitive attribute. This APIs would also be used for conv-pointwise fusion, which is in #140372.

# Validation
```bash
   python test/xpu/test_fusion.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140365
Approved by: https://github.com/etaf, https://github.com/guangyey, https://github.com/EikanWang
2025-05-27 15:57:15 +00:00
0d4de7872a [ROCm] Improve vectorized elementwise kernel performance in MI300X (#153634)
* Use non-temporal loads to improve the vectorized elementwise kernel performance on MI300
* Use thread_work_size of 8 or 16 for vectorized elementwise kernel

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153634
Approved by: https://github.com/jeffdaily
2025-05-27 15:38:43 +00:00
7ae204c3b6 [BE][CI][Easy] Run lintrunner on generated .pyi stub files (#150732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste
2025-05-27 14:58:02 +00:00
0a7eef140b Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022)
Fixes #153790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022
Approved by: https://github.com/Skylion007
2025-05-27 14:10:00 +00:00
d88699308f [CI][MacOS] Move more dependencies to pypi (#154309)
Hopefully last step before all Mac build/tests could be switched away from conda
- Update cmake version from 3.22 to 3.25 as 3.22 from pipy seems  to be unusable with python-3.12
- Add `--plat-name macosx_11_0_arm64` to setup.py command
- Remove `codesign` for cmake workaround (that was probably never really necessary
-  Install `libpng` and `jpeg-turbo` when building torchbench and build torchaudio without OpenMP (to be fixed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154309
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-27 13:49:40 +00:00
11a51a11af Revert "introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)"
This reverts commit 5c6d7caaaa08f134c3b17ce032cb014527b53417.

Reverted https://github.com/pytorch/pytorch/pull/153432 on behalf of https://github.com/malfet due to Looks like it broke flex attention tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=g6.4xlarge&mergeEphemeralLF=true ([comment](https://github.com/pytorch/pytorch/pull/153432#issuecomment-2912562570))
2025-05-27 13:42:34 +00:00
c52a002a22 Add getDeviceProperties api to torch mtia device (#153577)
topic: not user facing

Test Plan: Internal benchmark.

Differential Revision: D74256550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153577
Approved by: https://github.com/nautsimon
2025-05-27 11:55:58 +00:00
2bd95f3a1f Move inductor workflows focal (ubuntu 20.04) -> jammy (ubuntu 22.04) (#154153)
Trying to fix: https://github.com/pytorch/pytorch/issues/154157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154153
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/nike4949, https://github.com/cyyever
2025-05-27 11:53:47 +00:00
6f86c1ce1d Add pyrefly.toml (#154144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154144
Approved by: https://github.com/Skylion007
2025-05-27 10:16:30 +00:00
5c6d7caaaa introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-27 08:54:31 +00:00
dec5ab8d98 [MTIA Aten Backend][1.2/n] Migrate as_strided to in-tree, and add unit tests (#154336)
See context in PR https://github.com/pytorch/pytorch/pull/153670

This diff migrate as_strided to in-tree. I found it's not covered by `test_kernel_eager_ci` so also adding unit tests.

Differential Revision: [D75385404](https://our.internmc.facebook.com/intern/diff/D75385404/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154336
Approved by: https://github.com/nautsimon
2025-05-27 06:32:38 +00:00
ef6306e1c6 Revert "[executorch hash update] update the pinned executorch hash (#153436)"
This reverts commit 8d6139b8d8a75aab5ead4262ff59d48615ebee31.

Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2911206795))
2025-05-27 06:02:14 +00:00
870133b2a0 Use get_device_context in aoti runtime for XPU directly (#154360)
# Motivation
Reuse [c10::xpu::get_device_context](1bebe0424e/c10/xpu/XPUFunctions.h (L27)) directly to reduce overhead, as it returns a cached `sycl::context` managed by PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154360
Approved by: https://github.com/EikanWang
2025-05-27 05:55:59 +00:00
8d89cdceb6 fix a compilation issue when TORCH_XPU_ARCH_LIST is an empty string (#153604)
When `XPU_ARCH_FLAGS` is an empty string, compilation will fail on `C10_STRINGIZE(XPU_ARCH_FLAGS)` in file `torch/csrc/xpu/Module.cpp` on Windows.
This PR fixes this issue by setting `TORCH_XPU_ARCH_LIST` to `""` to avoid an empty string conversion in `C10_STRINGIZE()` when compiling without an AOT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153604
Approved by: https://github.com/guangyey, https://github.com/EikanWang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-27 05:26:46 +00:00
8d6139b8d8 [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-05-27 04:54:46 +00:00
912af9b2c2 update torchbench pin (#154256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154256
Approved by: https://github.com/huydhn
2025-05-27 04:40:54 +00:00
8d319607a7 [CPU][Brgemm] add s8s8 GEMM microkernel API (#154358)
As the title. `u8s8` and `u8u8` have already been supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154358
Approved by: https://github.com/leslie-fang-intel, https://github.com/Skylion007, https://github.com/Valentine233
2025-05-27 03:47:56 +00:00
f8010e7b93 [nativert] Move file_util to pytorch core (#153162)
Summary: fbcode//sigmoid/core/common -> fbcode//caffe2/torch/nativert/common

Test Plan: Github CI

Differential Revision: D74328089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153162
Approved by: https://github.com/zhxchen17
2025-05-27 03:42:47 +00:00
70d12ccc3f [Torch] Fix error message formatting in fp8 comparison logic (#153647)
Summary: Using `\` includes all the tabs from the next line in the error message.

Test Plan: Nothing, simply error message fixing

Reviewed By: exclamaforte

Differential Revision: D74539234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153647
Approved by: https://github.com/exclamaforte
2025-05-27 02:51:05 +00:00
100ec0b34a [Inductor] Allow passing in custom lowering dict to register_lowering() (#154344)
This PR adds support for passing in custom lowering dict to `register_lowering()`, which allows systems (e.g. Helion, https://github.com/pytorch-labs/helion/pull/80) that uses Inductor to maintain their own lowering dict instead of using the Inductor global `lowerings` dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154344
Approved by: https://github.com/jansel
2025-05-27 01:35:26 +00:00
cyy
3936e6141c Remove outdated CUDA 11 conditions (#154313)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154313
Approved by: https://github.com/eqy
2025-05-27 00:30:14 +00:00
6006352ed3 [BE] Refactor manywheel build scripts (#154372)
1. Remove `CentOS Linux` cases, since its deprecated
2. Remove logic for old CUDA versions
3. Remove logic for `CUDA_VERSION=12.4` since we deprecated CUDA 12.4 support
4. Simplify setting `USE_CUFILE=1` - only supported on CUDA 12.6 and 12.8 builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154372
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-05-26 23:17:23 +00:00
b643076e4e Revert "[executorch hash update] update the pinned executorch hash (#153436)"
This reverts commit b6868f290e4882f9c895b1c9476327974288eaba.

Reverted https://github.com/pytorch/pytorch/pull/153436 on behalf of https://github.com/malfet due to Broke ET sanity ([comment](https://github.com/pytorch/pytorch/pull/153436#issuecomment-2910692163))
2025-05-26 22:09:16 +00:00
aaf5cc13d9 [EASY] use guard_or_false instead of gso in Meta converter (#154234)
this was added in https://github.com/pytorch/pytorch/pull/141659, the current change keep the same intention
"i do not want to fail here if i cant tell if the size is zero or not"
i am not familiar enough in the code to know if we need here a runtime check, but looking at current
impl it seems that guard_or_false is appropriate to match current behaviour  and have the same effect of guard_size_oblivious here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154234
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164, #154167, #154172
2025-05-26 21:59:52 +00:00
e33feddb72 used guard_or_false instead of guard_size_oblivious inside maybe_reduce (#154172)
This was added in https://github.com/pytorch/pytorch/pull/119562
the idea in this loop seems to be the following.
```
    if (TORCH_GUARD_SIZE_OBLIVIOUS(size.sym_eq(1))) {
      // NB: we could short circuit this once needs_reduce is true but there's
      // no point since the reduction function will guard on this anyway
      if (!c10::guard_or_false(size.sym_eq(target), __FILE__, __LINE__)) {
        needs_reduce = true;
      }
    } else {
      if (!size.sym_eq(target).expect_true(__FILE__, __LINE__)) {
        fail();
      }
    }
  ```
  1. if we know size ==1
       1.1 : if we know for sure size == target --> no reduce needed.
       1.2 : we know for sure that size != target  --> we do reduction.
       1.3: we could not tell if size == target or not --> we do reduction.
  2. if we do now know if size ==1 or not
     we add a runtime assertions that size ==target and we fail at runtime if size is not equal to target.

We could have simplified 1.1 and always do reduction under 1.1, since doing 1.3 without runtime checks implies
that it is safe, but i feel the reason could be perf here? idk.

anyway using TORCH_GUARD_OR_FALSE instead of TORCH_GUARD_SIZE_OBLIVIOUS here is appropriate.
there is really no clear reason for size oblivious reasoning. or for this logic not to apply when size is not size like
size is always >=0 anyway. but bad reasoning can make us not able to infer that although we know its true here.

 python test/dynamo/test_misc.py -k test_validate_outputs_unbacked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154172
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164, #154167
2025-05-26 21:59:52 +00:00
ab5137b048 used guard_or_false instead of guard_size_oblivious in is_int_or_symint (#154167)
This is a short circuit, that we should not fail on. Before this PR we would not fail on u0, u0+u1,
only if they are size like.  but we will fail on u0-u1.. etc for no need.
guard_or_false seems appropriate for that reason.

This was added in https://github.com/pytorch/pytorch/pull/122145 there was no unit tests for me to verify
why it was added, i could not repo using the associated issue , the example does not work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154167
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #154154, #154164
2025-05-26 21:59:45 +00:00
1da2cc52bc [EASY] remove guard_size_oblivious from is_nonzero proxy call check (#154164)
This was added in https://github.com/pytorch/pytorch/pull/149637,
torch._check can handle unbacked there is no need for size oblivious reasoning here.

Note this does not make is_nonzero unbacked friendly. but that is a different story.
I ran the test added in  https://github.com/pytorch/pytorch/pull/149637 for veirfication.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154164
Approved by: https://github.com/aorenste, https://github.com/bobrenjc93
ghstack dependencies: #154154
2025-05-26 21:59:29 +00:00
f8a2998832 [EASY] used guard_or_false instead of guard_sizes_oblivious in pointless_view (#154154)
The change is direct and clear, the optimizations removes pointless_view iff it all sizes are the same if not we want to return false, there is no need for size oblivious  reasoning.

this was added in https://github.com/pytorch/pytorch/pull/139136, run existing tests that are added in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154154
Approved by: https://github.com/bobrenjc93
2025-05-26 21:59:21 +00:00
e89ee1e217 Pin almalinux version to 8.10-20250519 (#154367)
This PR pins Almalinux version to latest supported 8.10

This is related to: https://github.com/pytorch/pytorch/pull/154364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154367
Approved by: https://github.com/jeanschmidt, https://github.com/wdvr, https://github.com/malfet, https://github.com/huydhn
2025-05-26 20:08:20 +00:00
839c9c6156 Use property instead of ClassVar for Uniform.arg_constraints and Wishart.arg_constraints (#154361)
Fixes #154355

For these two distributions, the constraints depend on the actual values, and so `arg_constraints` cannot be a `ClassVar`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154361
Approved by: https://github.com/Skylion007
2025-05-26 17:48:28 +00:00
3f64502c98 Revert "Re-enable FakeTensor caching for SymInts (#152662)"
This reverts commit 7d11c61c26c596076613aa0111892f7cbccae32e.

Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1 ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))
2025-05-26 17:13:22 +00:00
187d38185e [cutlass backend] Do not raise hard error when re worker has cuda compilation error (#154173)
fbcode specific

Differential Revision: D75262641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154173
Approved by: https://github.com/bertmaher
2025-05-26 17:10:36 +00:00
f55f2f42a7 Add missing docstring for sym_ite (#154201)
`sym_ite` is listed in [the reference page](https://docs.pytorch.org/docs/stable/torch.html) and has no document.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154201
Approved by: https://github.com/Skylion007
2025-05-26 15:59:21 +00:00
02445ec8f0 Almalinux image, install glibc-langpack-en (#154364)
After update to: https://hub.docker.com/layers/amd64/almalinux/8/images/sha256-4f63eb966695df3c993deeacec7c73d87728e2ea66d3b48fed4b40cb547fa7c2

Started seeing warning: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
and random Segfaults when using python like:
https://github.com/pytorch/test-infra/actions/runs/15216565225/job/42901732536
```
+++ python -c 'import torch'
./check_binary.sh: line 258:  2276 Segmentation fault      (core dumped) python -c 'import torch'
```

Installing langpack does  resolve these issues: https://github.com/pytorch/test-infra/actions/runs/15256338815/job/42904808826#step:15:2311

Almalinux Docker build without setlocale warning:
https://github.com/pytorch/pytorch/actions/runs/15030284546/job/42240978131

Almalinux Docker build with setlocale warning:
https://github.com/pytorch/pytorch/actions/runs/15246391200/job/42873875745#step:3:7180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154364
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt
2025-05-26 15:56:42 +00:00
4b0ee3f4f2 [BE] Do not templetize unnnecessarily (#154305)
`${{ os.runner }}` would always evaluate to macOS for those files
And architecutre is always ARM64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154305
Approved by: https://github.com/atalman
2025-05-26 15:00:48 +00:00
7ab4fae62a Fix s390x vectorization compilation in inductor (#153946)
Fix s390x vectorization compilation in inductor.

One of failing tests is
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_add_complex_cpu
but it is still disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153946
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-05-26 12:54:25 +00:00
1bebe0424e Fix platform detection in MKLDNN CMake file (#142067)
When building PyTorch with `USE_XPU=True` and Clang,
the user sees misleading errors related to incorrect platform
detection that assumes that all users that are not using the GNU
compilers are on Windows. We can fix this by simply using CMake's
builtin platform detection variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142067
Approved by: https://github.com/EikanWang, https://github.com/min-jean-cho, https://github.com/guangyey
2025-05-26 06:09:37 +00:00
21e42c5d62 More descriptive error message for torch.nanmean() with complex dtypes (#153252)
Fixes #153132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153252
Approved by: https://github.com/colesbury
2025-05-26 05:42:57 +00:00
b6868f290e [executorch hash update] update the pinned executorch hash (#153436)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153436
Approved by: https://github.com/pytorchbot
2025-05-26 04:43:10 +00:00
7d11c61c26 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
062387fb53 [SymmMem] Speed up tests (#153677)
Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #153653
2025-05-26 03:39:11 +00:00
8c16d0e404 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-26 00:56:05 +00:00
b04852e404 Fix deterministic indexing with broadcast (#154296)
Fixes #79987, now for real.
Also removed thrust sort path that was needed for cuda <=11.2 because we no longer support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154296
Approved by: https://github.com/soumith
2025-05-25 21:14:50 +00:00
c3100067ae [ONNX] Update onnx to 1.18 (#153746)
Update onnx python package to 1.18.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153746
Approved by: https://github.com/titaiwangms, https://github.com/cyyever, https://github.com/malfet
2025-05-25 20:58:47 +00:00
43b2716e89 PYFMT lint grandfathered files 1 (#154261)
lint:
-  test/test_fake_tensor.py
-  test/test_flop_counter.py
- torch/_export/verifier.py

with same rules as other files, it was a night mare for me to update tests in one of the skipped files
with not being able to lint them locally like other files with lintrunner -a.
note that those file do have active dev and not old not touched files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-05-25 17:36:14 +00:00
5677ab9aab [BE] Correctly pass exceptions raised from rpc_init to CPython (#154325)
By decorating function body with `HANDLE_TH_ERRORS`

Partially addresses https://github.com/pytorch/pytorch/issues/154300

I.e. after that change, importing torch no longer crashes but returns a readable (and actionable exception)
```
>>> import torch
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    import torch
  File "/Users/malfet/git/pytorch/pytorch/torch/__init__.py", line 2134, in <module>
    from torch import _VF as _VF, functional as functional  # usort: skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/functional.py", line 8, in <module>
    import torch.nn.functional as F
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/__init__.py", line 8, in <module>
    from torch.nn.modules import *  # usort: skip # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Bilinear, Identity, LazyLinear, Linear  # usort: skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/modules/linear.py", line 7, in <module>
    from torch.nn import functional as F, init
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/functional.py", line 11, in <module>
    from torch._jit_internal import (
    ...<5 lines>...
    )
  File "/Users/malfet/git/pytorch/pytorch/torch/_jit_internal.py", line 42, in <module>
    import torch.distributed.rpc
  File "/Users/malfet/git/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 37, in <module>
    from torch._C._distributed_rpc import (  # noqa: F401
    ...<33 lines>...
    )
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154325
Approved by: https://github.com/Skylion007
2025-05-25 17:01:45 +00:00
31ae07b5e7 [CI] Do not install libuv on MacOS (#154307)
It's tensorpipe submodule and is build from source
Same for `dataclasses` as it's needed only for python-3.6
And get rid of `nidia-ml-py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154307
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #154304
2025-05-25 15:30:38 +00:00
6968386385 [BE] Sort requirements files alphabetically (#154304)
Using `sort` tool
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154304
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-25 15:30:38 +00:00
ed27ee8355 Bump setuptools from 70.0.0 to 78.1.1 in /tools/build/bazel (#154075)
Bumps [setuptools](https://github.com/pypa/setuptools) from 70.0.0 to 78.1.1.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v78.1.1</h1>
<h2>Bugfixes</h2>
<ul>
<li>More fully sanitized the filename in PackageIndex._download. (<a href="https://redirect.github.com/pypa/setuptools/issues/4946">#4946</a>)</li>
</ul>
<h1>v78.1.0</h1>
<h2>Features</h2>
<ul>
<li>Restore access to _get_vc_env with a warning. (<a href="https://redirect.github.com/pypa/setuptools/issues/4874">#4874</a>)</li>
</ul>
<h1>v78.0.2</h1>
<h2>Bugfixes</h2>
<ul>
<li>Postponed removals of deprecated dash-separated and uppercase fields in <code>setup.cfg</code>.
All packages with deprecated configurations are advised to move before 2026. (<a href="https://redirect.github.com/pypa/setuptools/issues/4911">#4911</a>)</li>
</ul>
<h1>v78.0.1</h1>
<h2>Misc</h2>
<ul>
<li><a href="https://redirect.github.com/pypa/setuptools/issues/4909">#4909</a></li>
</ul>
<h1>v78.0.0</h1>
<h2>Bugfixes</h2>
<ul>
<li>Reverted distutils changes that broke the monkey patching of command classes. (<a href="https://redirect.github.com/pypa/setuptools/issues/4902">#4902</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>Setuptools no longer accepts options containing uppercase or dash characters in <code>setup.cfg</code>.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="8e4868a036"><code>8e4868a</code></a> Bump version: 78.1.0 → 78.1.1</li>
<li><a href="100e9a61ad"><code>100e9a6</code></a> Merge pull request <a href="https://redirect.github.com/pypa/setuptools/issues/4951">#4951</a></li>
<li><a href="8faf1d7e0c"><code>8faf1d7</code></a> Add news fragment.</li>
<li><a href="2ca4a9fe47"><code>2ca4a9f</code></a> Rely on re.sub to perform the decision in one expression.</li>
<li><a href="e409e80029"><code>e409e80</code></a> Extract _sanitize method for sanitizing the filename.</li>
<li><a href="250a6d1797"><code>250a6d1</code></a> Add a check to ensure the name resolves relative to the tmpdir.</li>
<li><a href="d8390feaa9"><code>d8390fe</code></a> Extract _resolve_download_filename with test.</li>
<li><a href="4e1e89392d"><code>4e1e893</code></a> Merge <a href="https://github.com/jaraco/skeleton">https://github.com/jaraco/skeleton</a></li>
<li><a href="3a3144f0d2"><code>3a3144f</code></a> Fix typo: <code>pyproject.license</code> -&gt; <code>project.license</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4931">#4931</a>)</li>
<li><a href="d751068fd2"><code>d751068</code></a> Fix typo: pyproject.license -&gt; project.license</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v70.0.0...v78.1.1">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=70.0.0&new-version=78.1.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154075
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-25 15:13:03 +00:00
c113cf5a8f [BE] Remove unused conda-env-Linux-X64 (#154303)
According to https://github.com/search?type=code&q=conda-env-++repo%3Apytorch%2Fpytorch it's not referenced anywhere and has been replaced with `conda-env-ci` a while ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154303
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-25 14:24:28 +00:00
d8aed0703e [BE][Ez]: Enable ruff rule PLW1507. os.environ is not copied (#154120)
Enables a RUFF rule check against copying os.environ since its' actually a proxy object, not a dict so a shallow copy will be a noop which is rarely desired behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154120
Approved by: https://github.com/malfet
2025-05-25 14:22:57 +00:00
54932d865e Revert "[c10d] Add support for testing SIGABRT return (#153167)"
This reverts commit 03e102dbe8cbffc2e42a3122b262d02f03571de7.

Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))
2025-05-25 13:17:27 +00:00
c4ef4090c5 Fix segfault on exit in CachingHostAllocator by signaling background thread to exit (#154117)
Fixes #152008

This PR fixes a segmentation fault that occurred when exiting the program due to improper background thread management in CachingHostAllocator.

Previously, the background thread continued running and called process_events() even after the allocator object was destroyed, leading to a crash on exit.

f12d8d60b1/aten/src/ATen/core/CachingHostAllocator.h (L218)

```cpp
// Launch the background thread and process events in a loop.
static bool background_thread_flag [[maybe_unused]] = [this] {
  getBackgroundThreadPool()->run([&]() {
    while (true) {
      process_events();  // <-- This line may cause segfault on exit
      std::this_thread::sleep_for(std::chrono::microseconds(100));
    }
  });
  return true;
}();
```

The fix adds a mechanism to signal the background thread to exit before the object is destructed, ensuring the thread stops safely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154117
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-05-25 07:46:12 +00:00
9d922b55ef [Distributed][CI] Rework continuous TestCase (#153653)
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).

2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.

3. Added a test template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
2025-05-25 03:49:29 +00:00
03e102dbe8 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-25 03:48:34 +00:00
10c51b11ff Bump protobuf version and refactor tensorboard tests (#154244)
In preparation for https://github.com/pytorch/pytorch/pull/153746, I am bumping protobuf to 5.29.4 and fixing the tensorboard tests first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154244
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-05-25 00:50:07 +00:00
53ecb8159a Introduce statically_known_false (#154291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154291
Approved by: https://github.com/mengluy0125
2025-05-24 14:23:55 +00:00
2dfc0e3327 [Inductor UT] Reuse test_fused_attention.py for Intel GPU. (#154110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154110
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/EikanWang
2025-05-24 09:51:33 +00:00
cyy
8fe7ec6721 Add /Zc:preprocessor for torch libraries in MSVC builds (#147825)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147825
Approved by: https://github.com/janeyx99
2025-05-24 06:57:46 +00:00
6503b4a96e Update to using mypy 1.15 (#154054)
The BC break isn't real - mypy decided to start complaining about the way we were typing that function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154054
Approved by: https://github.com/Skylion007
2025-05-24 04:30:57 +00:00
76ed9db468 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-24 03:43:35 +00:00
1ab2993345 Add a link to transformer_building_blocks tutorial (#154281)
Cross-link to https://docs.pytorch.org/tutorials/intermediate/transformer_building_blocks.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154281
Approved by: https://github.com/mikaylagawarecki
2025-05-24 02:50:24 +00:00
e904d01c16 Make inductor UT to be generic (#154196)
# Motivation
https://github.com/pytorch/pytorch/pull/151773 introduces UT `test_triton_template_generated_code_caching` failed on XPU;
https://github.com/pytorch/pytorch/pull/153895 introduces UT `test_mutation_rename` failed on XPU;

fix https://github.com/pytorch/pytorch/issues/154218

# Additional Context
With this PR, both failed UTs passed on local machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154196
Approved by: https://github.com/jansel
2025-05-24 02:47:46 +00:00
a19f2cdf29 [draft export] skip when no LOC found (#154190)
Couldn't repro error, but verified fix with @ColinPeppler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154190
Approved by: https://github.com/ColinPeppler
2025-05-24 02:29:34 +00:00
975bbc63db [MPS][BE] Move fmod/remainder to Metal ops (#154280)
This accomplishes following:
 - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
 - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
 - Eliminates need for several correctness workarounds

Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
2025-05-24 01:45:33 +00:00
8f08bdb7f2 [MPS][BE] Code dedup (#154290)
Eliminate some copy-pasta by introducing `REGISTER_FLOAT_BINARY_OP` and `REGISTER_INTEGER_BINARY_OP` macros
Use `_METAL_310_PLUS` to guard bfloat dtype use
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154290
Approved by: https://github.com/yangw-dev, https://github.com/wdvr
ghstack dependencies: #154275
2025-05-24 01:41:31 +00:00
e5f63f4f66 [CI] Move Mac testing to 3.12 (#154177)
Prep step to completely move away from Conda during the builds..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154177
Approved by: https://github.com/huydhn, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271, #154269, #154270
2025-05-24 01:41:20 +00:00
11a490f32f [CI] Reuse old whl on more workflows (#154285)
Still only on main branch, not PRs, so that we can monitor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154285
Approved by: https://github.com/malfet
2025-05-24 01:25:35 +00:00
308beeeb56 [dynamo] Use UUID for compiled function variable names. (#154148)
Summary:
We previously assign each compiled function variable a name based on in-process global counter. This works fine within the same process but when we're trying to serialize the states with precompile, we need a way to load back these compiled functions without causing collision to the existing global scope.

Changing the counter to a true global uuid seems to resolve this issue.

For example, the new variable name will look like:
```
__compiled_fn_0_7ce7d872_4fe8_4174_b8fd_2496b09b8b43
```

Test Plan: CI

Differential Revision: D75244901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154148
Approved by: https://github.com/jansel
2025-05-24 01:08:42 +00:00
7ba6fb69e6 [Inductor][CPP] Enable vectorized fp8 E5M2 quant dequant (#153365)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E5M2` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e5m2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153365
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #152417, #152418, #153364
2025-05-23 23:20:02 +00:00
84b657d0b5 Add Vectorized FP8 E5M2 (#153364)
**Summary**
This PR mainly adding the `Vectorized<Float8_e5m2>` class to support the vectorization of `FP8 E5M2` with methods:

- Convert to/from `Vectorized<float>`
- Common vectorized methods like: `mul`, `abs`, `eq` and etc.

**Test Plan**
```
./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E5M2Test.*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153364
Approved by: https://github.com/jgong5, https://github.com/CaoE, https://github.com/vkuzo
ghstack dependencies: #152417, #152418
2025-05-23 23:11:25 +00:00
b77a6504fa [Inductor][CPP] Enable vectorized fp8 quant dequant (#152418)
**Summary**
This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E4M3` `quant` from `float32` and `dequant` to `float32`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e4m3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152418
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/CaoE
ghstack dependencies: #152417
2025-05-23 23:05:17 +00:00
080b74ce67 Add Vectorized FP8 E4M3 (#152417)
**Summary**
This PR mainly adding the `Vectorized<Float8_e4m3fn>` class to support the vectorization of `FP8 E4M3` with methods:

- Convert to/from `Vectorized<float>`
- Common vectorized methods like: `mul`, `abs`, `eq` and etc.

**Test Plan**
```
./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E4M3Test.*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152417
Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/vkuzo
2025-05-23 22:56:56 +00:00
bab59d3c28 Upgrade to CUDA 12.8.1 for nightly binaries (#152923)
Upgrade current CUDA 12.8 builds to 12.8.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152923
Approved by: https://github.com/atalman
2025-05-23 22:37:05 +00:00
f0b2706914 remove sleef_arm target (#154166)
Summary:
X-link: https://github.com/pytorch/executorch/pull/11082

We shouldn't need an ARM-specific variant; we have select() where we should need it.

Test Plan: CI

Reviewed By: nlutsenko

Differential Revision: D74356413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154166
Approved by: https://github.com/kimishpatel, https://github.com/malfet, https://github.com/Skylion007
2025-05-23 22:16:01 +00:00
86a160353e [BE] Don't run windows builds in pull.yml (#154264)
We already run windows builds and tests [during trunk.yml](c13eeaa718/.github/workflows/trunk.yml (L115-L130)).

Spot checking for failures of this job in pull.yml shows that the most of the times this job fails, the failure correlates with other build jobs failing as well, so it's not offering much unique signal.

Given that we'll run this job before merging the PR as part of trunk.yml anyways, the trade off of extra signal from getting a windows build signal a little earlier doesn't seem worth the infra investment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154264
Approved by: https://github.com/malfet
2025-05-23 22:03:19 +00:00
65f0cf3df5 [mergebot] Do not block on autoformat workflow (#154236)
Helps with https://github.com/pytorch/pytorch/issues/154084

Merge sometimes fails due to autoformat failing.  I believe it's because author doesn't have write perms/workflow running perms -> needs approval for workflows.  On merge, the bot adds the merge label -> triggers autoformat workflow -> needs approval (even though it will end up getting get skipped because the label doesn't match) -> merge sees and fails

So I put an ugly exception for the workflow in mergebot

Some restrictions to keep in mind:
* Need to checkout the PRs code changes to run lint/format on them -> possible security issue if someone modifies a linter/formatter
* The (third party) reusable action used in the autoformat workflow requires the trigger to be pull_request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154236
Approved by: https://github.com/malfet
2025-05-23 22:00:34 +00:00
bb17f9c98b [AOTAutogradCache] Fix CHROMIUM_EVENT_LOG being none (#154258)
It turns out if you import something that's None at import time in python, and later update the value, the one you imported stays none:

```
import torch
from torch._dynamo.utils import CHROMIUM_EVENT_LOG
class Foo:
  pass
torch._dynamo.utils.CHROMIUM_EVENT_LOG =  Foo()

print(CHROMIUM_EVENT_LOG) # None
```

This fixes teh bug so we get AOTAUtogradCache instant events again

Differential Revision: [D75305770](https://our.internmc.facebook.com/intern/diff/D75305770/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154258
Approved by: https://github.com/oulgen
2025-05-23 21:53:31 +00:00
0e4f1b8a06 [CI] Update MacOS conda requirmenets (#154270)
Pick package versions which are compatible with both 3.9 and 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154270
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271, #154269
2025-05-23 21:44:50 +00:00
5db1503846 [CI] Update MacOS numba and scipy versions (#154269)
Pick versions that supported by both 3.9 and 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154269
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237, #154268, #154271
2025-05-23 21:44:49 +00:00
aa3eab2ce6 Fix tcp init when using port 0 (#154156)
I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2025-05-23 21:41:58 +00:00
3c0b93afc5 Re-enable link linter (#153280)
And make URL linter always succeed for now.
I'll monitor the logs manually and experiment with it futher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153280
Approved by: https://github.com/albanD
2025-05-23 20:56:25 +00:00
6f34d141ab [MPS][BE] Delete complex_div (#154275)
An absolute no-op: delete `complex_div` from `UnaryKernel.metal` and use identical one from `c10/metal/utils.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154275
Approved by: https://github.com/dcci
2025-05-23 20:53:50 +00:00
dec6a47996 [BE] Delete unused pip-requirements-iOS.txt (#154271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154271
Approved by: https://github.com/clee2000
ghstack dependencies: #154237, #154268
2025-05-23 20:08:19 +00:00
acd0873d3b [CI] Fix TestDynamoTimed.test_ir_count for 3.12 (#154268)
Python-3.12 emits the same bytecode as 3.13 for code in question
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154268
Approved by: https://github.com/clee2000, https://github.com/atalman
ghstack dependencies: #154237
2025-05-23 20:08:19 +00:00
28af44285b Revert "[c10d] Add support for testing SIGABRT return (#153167)"
This reverts commit 499a76b844bbcbc5465cb76c617b3076c1b0fd65.

Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see fe784c5a2c/1 ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))
2025-05-23 19:44:08 +00:00
fe784c5a2c Fix torchbind path in AOTI package loader (#154265)
Summary: as title, fix the path in package loader and fix the test to take the additional dir into consideration.

Test Plan:
```
buck run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:torchbind
```

Reviewed By: angelayi

Differential Revision: D75308904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154265
Approved by: https://github.com/clee2000, https://github.com/malfet
2025-05-23 19:32:53 +00:00
90855835ff Revert "[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)"
This reverts commit 269fa8028f68b29176e21886108634f48b1eced7.

Reverted https://github.com/pytorch/pytorch/pull/154155 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/154155#issuecomment-2905514934))
2025-05-23 19:08:40 +00:00
3b21d79225 [export] Move PT2ArchiveWriter/Reader to torch/export (#153795)
Summary:
Before:
`from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package`
After:
`from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package`

By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder.

Before:
```
├── archive_format
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── aotinductor

```
After:
```
├── tmp
│   ├── archive_format
│   ├── byteorder
│   ├── .data
│   │   ├── serialization_id
│   │   └── version
│   ├── data
│   │   ├── aotinductor
```

Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/5348024839248187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795
Approved by: https://github.com/zhxchen17
2025-05-23 19:04:36 +00:00
499a76b844 [c10d] Add support for testing SIGABRT return (#153167)
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.

These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.

Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
2025-05-23 19:04:28 +00:00
561a11aa68 Revert "Patch the _is_conv_node function (#153749)"
This reverts commit c985cec5b2545d46af682d486b18866eee5dffd5.

Reverted https://github.com/pytorch/pytorch/pull/153749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153749#issuecomment-2905504697))
2025-05-23 19:04:20 +00:00
4ff19ecf66 Revert "[export] Move PT2ArchiveWriter/Reader to torch/export (#153795)"
This reverts commit 7e80f23516a86e18ae5bc5579d3005c1e7610102.

Reverted https://github.com/pytorch/pytorch/pull/153795 on behalf of https://github.com/malfet due to Looks like it broke lots of tests, see ec368a1903/1 ([comment](https://github.com/pytorch/pytorch/pull/153795#issuecomment-2905415496))
2025-05-23 18:29:08 +00:00
ec368a1903 Add sitemap (#154158)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154158
Approved by: https://github.com/albanD
2025-05-23 18:01:00 +00:00
0d62fd5c3c [MTIA Aten Backend][2/n] Migrate clamp ops(clamp.out/clamp_min.out/clamp_max.out) from out-of-tree to in-tree (#154015)
Summary:
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This PR
1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops):
- clamp.out
- clamp_min.out
- clamp_max.out
2. Also enabled structured kernel codegen for MTIA, which is needed by clamp
3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`)

Differential Revision: D74674418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-05-23 17:59:47 +00:00
bcb2125f0a [BE][CI] Update expecttest version to 0.3.0 (#154237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154237
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/atalman
2025-05-23 17:27:41 +00:00
cae25ef4e5 [c10d] Enhance Error Logging in new_subgroups() for Non-Divisible World Sizes (#154124)
Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness.

Test Plan: contbuild & OSS CI

Differential Revision: D75226925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124
Approved by: https://github.com/wz337
2025-05-23 17:12:43 +00:00
e927ba6dbd [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-23 17:12:25 +00:00
04a6fe7914 Update provenance tracking doc (#154062)
Summary: Update the doc to reflect the changes in https://github.com/pytorch/pytorch/pull/153584/files#diff-e0cdb58c0f84f56f20c5433339b6d83c470dcde47847e2328effea6bedd4cd27 and https://github.com/pytorch/tlparse/pull/110

Test Plan: CI

Differential Revision: D75155981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154062
Approved by: https://github.com/svekars, https://github.com/desertfire
2025-05-23 17:09:52 +00:00
7d8ea5db69 Disable cache and utilization stats uploading steps on s390x (#150297)
There are no AWS credentials available on s390x runners. These steps are failing anyway due to that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150297
Approved by: https://github.com/seemethere
2025-05-23 16:49:38 +00:00
7e80f23516 [export] Move PT2ArchiveWriter/Reader to torch/export (#153795)
Summary:
Before:
`from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package`
After:
`from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package`

By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder.

Before:
```
├── archive_format
├── byteorder
├── .data
│   ├── serialization_id
│   └── version
├── data
│   ├── aotinductor

```
After:
```
├── tmp
│   ├── archive_format
│   ├── byteorder
│   ├── .data
│   │   ├── serialization_id
│   │   └── version
│   ├── data
│   │   ├── aotinductor
```

Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/5348024839248187

Differential Revision: D74616598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795
Approved by: https://github.com/zhxchen17
2025-05-23 15:40:25 +00:00
214e4cef9f Fix RMSNorm doc rendering (#154205)
By removing `::func::` decorator which adds unneeded parenthesis

Test plan: Check https://docs-preview.pytorch.org/pytorch/pytorch/154205/generated/torch.nn.RMSNorm.html#rmsnorm
that now renders as
<img width="704" alt="image" src="https://github.com/user-attachments/assets/443f605d-75a6-41ef-8971-21e7dc8ef9f6" />

Fixes https://github.com/pytorch/pytorch/issues/154184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154205
Approved by: https://github.com/mikaylagawarecki
2025-05-23 15:39:29 +00:00
9e089bb5b6 change guard_or impl for better perf and simplicity (#153674)
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
4b7abce6a4 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

The latest version of this also:
1. Addresses the problem that caused #153891.
    The issue was that with caching ops are required to support `__eq__`.  Unfortunately _RecordFunction is minimalistic and doesn't support that - so in the off-chance that two keys hash to the same value the `__eq__` check would raise an exception.

    Apparently this was much more common on MacOS where memory patterns end up with more reuse (so the object IDs are the same and give you the same hash value for objects that use pointer hash).

    Tested locally on MacOS where running
```
python test/inductor/test_torchinductor.py GPUTests
```
was pretty much guaranteed to fail (at least for me) somewhere around test 100-200 and passed all 800 tests after this change.

Another way to test this is to run the inductor tests with `torch._subclasses.fake_tensor._DispatchCacheKey.__hash__` monkey-patched to return a constant (causing all values to hash-collide) but this can't really be checked-in since it causes the cache lookup to turn into an O(n) lookup which takes a crazy long time to run through all the tests...

2. Folds in #153780 to ensure that exceptions raised from the op don't include the context from the cache key bypass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-23 15:03:31 +00:00
866142ff16 Revert "Update the heuristic for AArch64 bmm/baddbmm (#149122)"
This reverts commit d759a517af3e6b2337bf8f8e0d1734e64e470f1b.

Reverted https://github.com/pytorch/pytorch/pull/149122 on behalf of https://github.com/jeanschmidt due to breaking internal models, @malfet may you help merge this? ([comment](https://github.com/pytorch/pytorch/pull/149122#issuecomment-2904703075))
2025-05-23 14:54:54 +00:00
5859582ee4 [BE][MPS] Delete unused complex_mul_out (#154175)
It's no longer called, after `mul` has been migrated to binary op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154175
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-05-23 13:44:24 +00:00
2225231a14 Enable AArch64 CI scripts to be used for local dev (#143190)
- Allow user to specify custom ComputeLibrary directory, which is then built rather than checking out a clean copy
- Remove `setup.py clean` in build. The CI environment should be clean already, removing this enables incremental rebuilds
- Use all cores for building ComputeLibrary

Mostly a port of https://github.com/pytorch/builder/pull/2028 with the conda part removed, because aarch64_ci_setup.sh has changed and can now handle being called twice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143190
Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet

Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>
2025-05-23 12:09:59 +00:00
25149cd173 [c10d] Add more tests to prevent extra context (#154174)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Loop a bunch of sync ops and see if any of them creates extra context.
Requires nvml to check number of processes resident on a device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174
Approved by: https://github.com/atalman
2025-05-23 09:54:01 +00:00
ba5d45d22e Add assertion to align with cuda (#153233)
Fixes #153137

Aligned batch_norm_cpu_out assertion to [batch_norm_cuda_out](a7ea115494/aten/src/ATen/native/cuda/Normalization.cu (L436)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153233
Approved by: https://github.com/malfet
2025-05-23 07:32:43 +00:00
5623d30228 [Minimizer] Gracefully exit when there is no discrepancy in block mode (#154076)
Summary:
Previously, when there is no discrepancy in results for block mode, net_min_base will throw an OOB error.

This occurs due to the block _block_traverse_impl returning an OOB after exhausting subgraphs all the way down to a single node

There is also an issue where we may get an unsound subgraph (i.e. mark an earlier node as the "end" even if the correct end is later). This is due to an incorrect check (start_idx == mid) where there can possibly be two values left before the program pre-maturely returns

Test Plan:
Buck UI: https://www.internalfb.com/buck2/52524c26-ace5-4593-8a4b-843a54eb206a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3096224973363310
Network: Up: 0B  Down: 15MiB  (reSessionID-cd404e97-395f-49fc-8381-373e90a1378f)
Executing actions. Remaining     0/1
Command: test.
Time elapsed: 53.7s
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D75143242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154076
Approved by: https://github.com/jfix71
2025-05-23 06:42:07 +00:00
8342b9371e [ROCm] Prefer hipblaslt for gfx1200, gfx1201 (#153610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153610
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2025-05-23 06:01:53 +00:00
26471fc203 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-23 05:45:35 +00:00
b33b7d5c8c [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-23 05:45:35 +00:00
269fa8028f [AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155)
Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/)

In general, we want to cache the cutlass kernels.

Also saw an error saying .o not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155
Approved by: https://github.com/chenyang78
2025-05-23 04:51:36 +00:00
5bb156a7fd [dynamo] raise observed exception for module attribute errors (#153659)
Fixes https://github.com/pytorch/pytorch/issues/153605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153659
Approved by: https://github.com/StrongerXi
2025-05-23 03:56:26 +00:00
db1f33147b [audio hash update] update the pinned audio hash (#154001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154001
Approved by: https://github.com/pytorchbot
2025-05-23 03:51:21 +00:00
c1055f41a6 Data dependent free reshape. (#153198)
#### change 1: if compute_strides stride fail for reshape just clone.

Lets consider the most general case, if torch compile is asked to reshape [u0, u1][u3, u4] -> [u5, u6] what shall it do?
The shape is general enough to represent both contiguous and non contiguous tensors, tensors where a clone free reshape can happen and other where a clone free cant happen.  The current algorithm will fail due to data dependent errors.

The general idea is if its impossible to tell if the reshape can happen in place, (because for some concrete inputs
it will and other not) then its ok to take the general path and clone, instead of failing or asking the user to give hints.
**Because the user want a single graph (single compilations)** and this is the only way it can be done.
Had this been a view? then the user is explicitly asking for a copy-free reshape, we would fail asking for more
information (hints in torch.checks form).

with this change reshape works as the following:
1. if we know the input is contiguous we will convert the reshape to view.
2. if compute_strides succeed we will use view. (compute_strides  was changed to not fail when when unbacked presented instead it will just return nullptr if it cant compute the strides meaning we shall use a clone).
3. if neither 1, 2 works clone and use a view.

Side note: having a view does not mean that inductor will not clone, for inductor there is a pass that converts all views back to reshapes and inductor has its logic dealing with those.

#### change 2 : skip  _reshape_view_helper and fall back to simpler logic if it fail.
We trace _reshape_view_helper when doing fake tensor tracing , but not during proxy tracing. hence such tracing wont effect the graph (only compute output shapes of several operations). We should not fail there, because it should always be possible for us to pass it in case of reshape.

i.e. when reshape_symint was called we would have either cloned, or compute_strides succeeded so the view should pass. What I did is the following: we run _reshape_view_helper, if we fail due to unbacked we call _view_simple which will succeed always for reshapes, (might fail for views when its impossible to do the view, in such case we throw the dde that was thrown by the original algorithm).

Ideally I would want to register _view_simple as the meta for view and avoid calling  _reshape_view_helper completely but I am running some issues with the dispatcher with subclasses and I do not have time to debug it. Namely one test
would end up calling some c++ view function that does not support symints during meta dispatch when i register a
python meta decompositions
```python test/dynamo/test_subclasses.py SubclassTests.test_subclass_views_dynamic_True ```
 https://github.com/pytorch/pytorch/issues/153303.I will follow up with that change in a separate PR.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @bdhirsh

 Two other alternatives for registering   _view_simple as meta and the try catch approach in this PR is:
 1. call _view_simple if any input is dynamic see  #153521
 2. if we make is_compiling works for framework code tracing (does not work rn) we can call _view_simple
 is if is_compiling.

#### Note:
Reshape can still fail when is_contiguous is called, Next PR will handle that by calling is_known_contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153198
Approved by: https://github.com/etaf, https://github.com/bobrenjc93
2025-05-23 01:45:16 +00:00
f74842d665 [DTensor] enable SimpleFSDP's composability with Tensor Parallel (#152286)
This PR adds support for SimpleFSDP's composability with Tensor Parallel + torch.compile.

`_StridedShard` is used in SimpleFSDP/FSDP2 to support correct distributed checkpointing when FSDP+TP is applied. Previously, `_StridedShard` is not guarded by torch.compile. This PR adds `_StridedShard` as an additional placement type to be guarded by torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152286
Approved by: https://github.com/bdhirsh
2025-05-23 01:40:38 +00:00
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
3c0cbf4b44 Update GH action to use the correct label (#154126)
Update GH action to use the correct label for the docathon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154126
Approved by: https://github.com/AlannaBurke, https://github.com/clee2000
2025-05-23 00:29:43 +00:00
31f3ee0966 [BE][Ez]: Enable PT014 check for duplicate parameterize test cases (#154118)
Ruff rule which checks for an error [PT014](https://docs.astral.sh/ruff/rules/pytest-duplicate-parametrize-test-cases/) where a user might specify two duplicate test cases in pytest.parameterize, which is likely an error since it tests the same thing twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154118
Approved by: https://github.com/malfet
2025-05-23 00:00:53 +00:00
7b25ff7cf2 [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-05-22 23:37:03 +00:00
59c5fff2aa Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404)"
This reverts commit a79e621c1c11bcef5f816b9770b751237b84f620.

Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))
2025-05-22 22:26:59 +00:00
f2cce45657 [libc++ readiness][caffe2] No reason to check for "ext/stdio_filebuf.h" (#154080)
Summary: There should be no reason to check for existence of this GNU C++ header here in this file. It doesn't include it. Removing this condition to make it build under libc++.

Differential Revision: D75179136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154080
Approved by: https://github.com/soumith
2025-05-22 22:23:39 +00:00
c985cec5b2 Patch the _is_conv_node function (#153749)
Summary: torch.ops.aten.conv2d.padding is also conv2d node

Differential Revision: D74898941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153749
Approved by: https://github.com/andrewor14, https://github.com/Skylion007
2025-05-22 22:17:02 +00:00
413664b3c5 catch CSE recursion depth errors (#154039)
Fixes #153777

CSE is an optimization and shouldn't block a compile if it hits recursion depth limits. Unfortunately we can't write this iteratively due to a dependency on `ast.unparse` which necessarily needs to do recursion. This PR catches opts out of CSE when we hit recursion depth errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154039
Approved by: https://github.com/Microve
2025-05-22 20:17:19 +00:00
cad0727fe1 Rename the provenance tracing artifact name for kernel <-> post_grad nodes mapping (#154046)
Summary:
Context:

Recently we've added a couple more kernel types support other than inductor generated triton kernels,

such as cpu cpp kernels, extern kernels.

The name appeared in tlparse chrome link can be confusing to users.

Rename from

`inductor_triton_kernel_to_post_grad_nodes.json`

to `inductor_generated_kernel_to_post_grad_nodes.json`

Test Plan: CI

Differential Revision: D75159042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154046
Approved by: https://github.com/yushangdi
2025-05-22 19:20:56 +00:00
4277907d02 [binary builds] Linux aarch64 CUDA builds. Make sure tag is set correctly (#154045)
1. This should set the Manylinux 2.28 tag correctly for CUDA Aarch builds.
I believe we used to have something similar in the old script:
https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/build_aarch64_wheel.py#L811

``Tag: cp311-cp311-linux_aarch64 ``-> ``Tag: cp311-cp311-manylinux_2_28_aarch64``

2. Remove section for CUDA 12.6, since we no longer building CUDA 12.6 aarch64 builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154045
Approved by: https://github.com/Camyll, https://github.com/malfet
2025-05-22 18:36:13 +00:00
788d9cb2d7 [3/n][Optimus][Auto-AC][reland] Support any fp8 quantization type and set scaling as the default" (#154057)
Summary:
This is a reland of D74910193.
We change the dtype to torch.float8_e5m2 in unit test since it is not supported.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Differential Revision: D75169792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154057
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:26:34 +00:00
c2660d29a5 [ROCm] Added unit test to test the cuda_pluggable allocator (#154041)
Added unit test to include the cuda_pluggable allocator and replicate the apex setup.py to build nccl_allocator extension

This test to check if this commit https://github.com/pytorch/pytorch/pull/152179 helps to build the cuda pluggable allocator in Rocm/Apex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154041
Approved by: https://github.com/atalman, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-05-22 18:22:15 +00:00
5b8f422561 [PT2][Optimus] Fix a typo in decompose_mm (#154048)
Summary: As titled

Differential Revision: D75160513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154048
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:11:40 +00:00
633ed01145 [MPS] Add support for two more isin variants (#154010)
`isin_Tensor_Scalar_out` is just a redispatch to eq/neq
`isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars
Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps`

Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage

Before this change both of those failed
```python
>>> import torch
>>> t = torch.tensor([0, 1, 2], device='mps')
>>> torch.isin(t, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
>>> torch.isin(1, t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales
ghstack dependencies: #153970, #153971, #153997
2025-05-22 17:59:35 +00:00
7421c21b5e remove unused code. (#153979)
Remove the unused cmake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979
Approved by: https://github.com/albanD
2025-05-22 17:50:11 +00:00
fc859077a0 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = return 0 if u0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-22 17:25:38 +00:00
025c5cc048 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit d23762974eae105aad837188d5d2254ea9783b37.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2901916364))
2025-05-22 16:52:04 +00:00
7d3dab6b90 Revert "[BE]: Type previously untyped decorators (#153726)"
This reverts commit b7d08defe9cfe1595ff680f845b39f5e03a89555.

Reverted https://github.com/pytorch/pytorch/pull/153726 on behalf of https://github.com/yangw-dev due to sorry, it seems like your pr failed typecheck error internally, [D75155486](https://www.internalfb.com/diff/D75155486) ([comment](https://github.com/pytorch/pytorch/pull/153726#issuecomment-2901911114))
2025-05-22 16:49:08 +00:00
a15550b776 [Cutlass] Use env var for EVT flag (#154099)
Swaps out hard flag for environment variable in inductor config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154099
Approved by: https://github.com/eellison
2025-05-22 16:36:57 +00:00
a82c8891d5 Revert "[aoti] Add MPS runner and shim (#153964)"
This reverts commit 918ae5d36188f419a47f3b1315f9fb373035ed66.

Reverted https://github.com/pytorch/pytorch/pull/153964 on behalf of https://github.com/angelayi due to broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153964#issuecomment-2901876832))
2025-05-22 16:35:59 +00:00
47a01f3efb Revert "[aoti] Initial Metal support (#153959)"
This reverts commit 28bcd9eb30336b370298dbe9677b95019882f2a8.

Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))
2025-05-22 16:17:07 +00:00
f419373dd3 [inductor] lowering for fractional_max_pool3d (#148630)
also a lowering with a reduction for large window_sizes for
fractional_max_pool2d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148630
Approved by: https://github.com/eellison
2025-05-22 16:06:29 +00:00
9a8c42ff94 Get rid of unused code in linters (#154043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154043
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-05-22 15:24:54 +00:00
35ddad284d update mutation renames (#153895)
Thanks to @PaulZhang12 for original find. When we finalize a multi template buffer, we need to reflect mutation renaming in dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153895
Approved by: https://github.com/PaulZhang12
2025-05-22 14:54:39 +00:00
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
4439255148 [aotd] Support saved tensors hooks in aot_autograd (#150032)
https://github.com/pytorch/pytorch/issues/148222

Goal:

At the moment autograd saved tensors hooks are run in eager after compiled forward.
They are executed at the same time for all saved tensors.
Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu.
This is suboptimal for optimization of peak memory.
Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor.

To get user specified autograd saved tensors hooks in the graph.

Logic:

UX:
If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm).
Where pack_gm and unpack_gm are torch.fx.GraphModule.
Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue.

User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes.

This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule.

In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata.

If this metadata set - then aot_autograd cache can use saved cache artifact.
If metadata is not set - then cache is bypassed.

Dynamo:
Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default).

The complexity here is that at this moment we do not have example of inputs for the hooks.
We trace  pack_hook with some Tensor from the inputs.
The result subgraphs are added to the hashing of AotAutograd Cache.

In AotAutograd we retrace the graph with the true saved tensors coming from partitioner.

Backwards Compatibility:
As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks).
For other hooks or if compiled autograd is enabled - keep the same logic.

Recompilations:
Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function.

Aot_autograd:
After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user.

We do not try to put it close the last usage etc., relying on inductor to do this optimization.

```
INFO: TRACED GRAPH
 ===== Forward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== saved_tensors_pack_hook add 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== saved_tensors_unpack_hook add 3 =====
 <eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== Forward graph 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn)

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]);  add = None
        return (view, _to_copy, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph 3 =====
 <eval_with_key>.21 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32);  add_packed_2 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy);  tangents_1 = _to_copy = None
        return (None, None, add_7)

```

Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032
Approved by: https://github.com/bdhirsh
2025-05-22 14:09:38 +00:00
f12d8d60b1 Add hint message when parameters is empty in clip_grad_norm_ (#151529)
Fixes #148259

## Changes

- Add print warning message when `parameters` generator exhausted

## Test Result
### print warning
```python

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

model = SimpleModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

inputs = torch.randn(16, 10)
targets = torch.randn(16, 1)

outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()

params_to_clip = model.parameters()

for p in params_to_clip:
    print(p.shape)

max_norm = 1.0
norm_type = 2.0
total_norm = nn.utils.clip_grad_norm_(params_to_clip, max_norm, norm_type)
print(f"total_norm: {total_norm}")
```

```bash
/home/zong/code/pytorch/torch/nn/utils/clip_grad.py:222: UserWarning: `parameters` is an empty generator, no gradient clipping will occur.
  warnings.warn(
total_norm: 0.0
```

### UT

```bash
pytest test/test_nn.py -k test_clip_grad_norm
```

![image](https://github.com/user-attachments/assets/0aa0f06c-e0a5-43cf-9a97-d7c2747c9180)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151529
Approved by: https://github.com/jbschlosser
2025-05-22 11:23:39 +00:00
40e6ca24ef Update CPU Inductor merge rules by adding more CPP Template (#152086)
**Summary**
Add more CPP Template into the CPU Inductor merge rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152086
Approved by: https://github.com/atalman
2025-05-22 09:46:26 +00:00
2f57ee579d S390x update docker image (#153619)
Add ninja-build for pytorch tests.
Switch to gcc 14 due to fix for precompiled headers and s390x vectorization interaction.
Disable -Werror when building onnxruntime.
Pin onnx version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153619
Approved by: https://github.com/huydhn
2025-05-22 09:34:46 +00:00
d7a83ab67b Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 (#149312)
Fixes #102261

## Changes

- Use flag `_is_initial` to replace `self.last_epoch == 0` condition to judge whether `lr` should be initial value
- Add test for `ExponentialLR` checkpoint usecase

## Test Result

```python
pytest -s test/optim/test_lrscheduler.py  -vv
```

![image](https://github.com/user-attachments/assets/6fd32bcc-b4fb-4421-b891-620bd4900dc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149312
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-05-22 08:42:37 +00:00
423fc671e9 [Cutlass] Support float8_e4m3fn GEMM (#153890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-22 08:37:33 +00:00
c1b7dbc52a [dynamo] unimplemented -> unimplemented_v2 in variables/dict.py (#154040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154040
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-22 06:46:10 +00:00
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
482e5b6660 [inductor] Added precompilation_timeout_seconds into a config instead of hardcoded (#153788)
Fixes #153392

- Updated config.py to add the timeout as a config var to be tuned dynamically (default is 3600s).
- Passed the var as a kwarg during call on instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153788
Approved by: https://github.com/henrylhtsang
2025-05-22 06:44:02 +00:00
7128b50a65 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it
https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
2025-05-22 06:33:29 +00:00
4bcff4af99 Move prologue_supported_inputs computations to def_kernal (#150869)
This avoid replaying load_input on a cache hit on the generate_code_cache.
the idea is that if a template have prologue_loads_all_inputs = True, it means that
all all inputs are loaded and hence no need to replay

Effect on the current benchmark on a local run on dev server.
18549985383 -> 15072230073
25697270062 -> 20738613297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150869
Approved by: https://github.com/eellison
2025-05-22 06:24:44 +00:00
4421aee558 torch.compile: Supress stdout / stderr output from subprocesses when local (#153837)
Summary:
This output is extremely noisy - i.e. on a 96 core machine, with 8 ranks, you
can get ~700 duplicate set of logs from each worker.

Differential Revision: D74907920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153837
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-05-22 05:49:43 +00:00
f2af30fee5 Add a HOP to bypass tracing of a wrapper function while tracing the wrapped function (#153487)
Usage:
```python
from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper

# Your ordinary function wrapper
def my_hop_fn_impl(fn, *args, k=1, **kwargs):
    def wrapper(*args, **kwargs):
        out = fn(*args, **kwargs)
        if isinstance(out, tuple):
            return (out[0] + k,)
        return out + k

    return wrapper

# Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph
def my_hop_fn(fn, *args, k=1, **kwargs):
    return dynamo_bypassing_wrapper(
        functools.partial(my_hop_fn_impl, k=k), fn, *args, **kwargs
    )
```

Notes:
- The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn.
- The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487
Approved by: https://github.com/ydwu4
2025-05-22 04:24:38 +00:00
669b176d4c [Graph Partition] support removed arguments, NoneLayout, and mutation (#153899)
Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases:

1. `NoneLayout` is not allocated so it cannot become a partition input or output.
2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`.
3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen.

This PR supports these 3 cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899
Approved by: https://github.com/eellison
2025-05-22 04:24:31 +00:00
d1fe198df6 [cond] support output the same unbacked symbol from two branches (#148206)
Previously, we didn't track the unbacked symbols leaked out of true_branch and false_branch if they have the same shape expr. This cause the the fake output of cond operator itself doesn't set up its unbacked_bindings meta properly (because they're ignored).

In this PR, we also check whether there're leaked out unbacked symbols and create new unbacked symbols for it and track it as output of cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148206
Approved by: https://github.com/zou3519
2025-05-22 03:39:43 +00:00
fe285b9560 [aoti] fix corner case in unbacked replacements for atomically_apply_size_hint (#153768)
## PR
There are a few cases that my previous PR (#153220) didn't cover.
1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`.
2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this.
3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this.

## Device assertion msg

```
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
```

## Autotuning code setup
This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`).

It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1).
```
in_buf0 = generate_example_value(size=(u1 + s0, 256))   # concrete size is (17900, 256)
in_buf1 = generate_example_value(size=(u0, 10))         # concrete size is (8192, 10)
...
out_buf = generate_example_value(size=(u1 + s0, 266))   # concrete size is (17900, 256+10)
triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...)
```

If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads.
- `tmp6`  makes sure we're only loading with the `xindex` from 256 to 264.
- `xmask` makes sure we're only loading with the `xindex` within `xnumel`.
- `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`.

The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load.
```
def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)
    xmask = xindex < xnumel
    x0 = (xindex % 264)
    x1 = xindex // 264
    ...
    tmp6 = x0 >= tl.full([1], value=256)
    tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask)
    # device assertion is thrown here
    tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) | ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768
Approved by: https://github.com/jingsh
2025-05-22 02:05:37 +00:00
a264af8c71 Support fp8 output of _scaled_mm for CPU (#153600)
This PR is to support fp8 output of torch._scaled_mm for CPU, and create related UTs with fp8 and bf16/fp16/fp32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153600
Approved by: https://github.com/leslie-fang-intel, https://github.com/mingfeima, https://github.com/jansel
2025-05-22 01:15:39 +00:00
254293b777 Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506)
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).

Update: added support for TORCH_LOGS for the metrics logging.

Test Plan:
mm_loop.py and many existing tests cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
261897734a Revert "cpp_wrapper: build non-performance-sensitive code at O1 (#148773)"
This reverts commit 3c89cfd46075e62c1725b43557612901a9cbb6fa.

Reverted https://github.com/pytorch/pytorch/pull/148773 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that pr_time_benchmark is regressed after this land ([comment](https://github.com/pytorch/pytorch/pull/148773#issuecomment-2899545140))
2025-05-22 00:11:14 +00:00
7ef2c62fd3 [ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices (#152341)
This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices.

Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters.

This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure.

Points of discussion:

* For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime
* CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed
* Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels

** Testing **

Use the existing tests in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152341
Approved by: https://github.com/chenyang78
2025-05-21 23:59:16 +00:00
87fc5af1f6 [c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055)
Work around issues like #153960, #152623

NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055
Approved by: https://github.com/atalman
2025-05-21 23:46:52 +00:00
fae6f6c9ca [aot] fix deepcopying of aot bwd containing real tensors (#153999)
Previously when we lower backward AOT due to symints, the post grad passes would leave the bw_module in a non-runnable state. This caused issues when compiled autograd tried to trace at runtime. So we had inductor operate on a deepcopy of bw_module.

But with https://github.com/pytorch/pytorch/issues/153993, we see that deepcopying real tensors will fail under fake mode due to the device type mismatch between the fake tensors ("meta" device) and the real tensor. So by disabling fake mode, we avoid these errors. This change is a strict improvement over current, but it does reveal that this deepcopy can theoretically cause OOMs.

FIXES https://github.com/pytorch/pytorch/issues/153993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153999
Approved by: https://github.com/jamesjwu, https://github.com/bdhirsh
2025-05-21 23:30:02 +00:00
67f9feeee7 remove TestCustomOp.test_impl_device_cpu from dynamo expected failures (#154049)
Fixes https://github.com/pytorch/pytorch/issues/153763 maybe?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154049
Approved by: https://github.com/StrongerXi
2025-05-21 23:20:30 +00:00
5ee1242310 Follow up to #152209, remove compat patch after docker image rename (#152958)
Remove compat patch that lets PRs that haven't rebased base #152209 still have docker images.

Merge ~2 weeks after the above PR was merged.  ~80% of PRs have a merge base that is <2 weeks old

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152958
Approved by: https://github.com/huydhn
2025-05-21 23:11:29 +00:00
d82610c2af docs: fix "should not to be" typo in register_buffer docstring (#153817)
Corrects a small grammatical error in `register_buffer` docstring, from "...  should not to be ..." to "...  should not be ...". Docs-only change, so no runtime behavior, tests, or APIs are affected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153817
Approved by: https://github.com/mikaylagawarecki
2025-05-21 22:46:50 +00:00
b967b7b11e Update rnn.py, fix torch.nn.RNN document error (#153620)
I found the same issue as #147490 (@jibril-b-coulibaly).

There's an equivalent in the [doc-string](https://docs.pytorch.org/docs/stable/generated/torch.nn.RNN.html#rnn) of `torch.nn.RNN`:

```python
# Efficient implementation equivalent to the following with bidirectional=False
def forward(x, hx=None):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if hx is None:
        hx = torch.zeros(num_layers, batch_size, hidden_size)
    h_t_minus_1 = hx
    h_t = hx
    output = []
    for t in range(seq_len):
        for layer in range(num_layers):
            h_t[layer] = torch.tanh(
                x[t] @ weight_ih[layer].T
                + bias_ih[layer]
                + h_t_minus_1[layer] @ weight_hh[layer].T
                + bias_hh[layer]
            )
        output.append(h_t[-1])
        h_t_minus_1 = h_t
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t

```

However there's something wrong.

1. Like mentioned in #147490, line 499 is wrong

fb55bac3de/torch/nn/modules/rnn.py (L499)

The **input for RNNCell should be different** for different layers.

2. The code contains several hidden **reference-related issues** that may result in unintended modifications to tensors. For example in line 504, this causes all elements in the final output list to point to the same tensor.

fb55bac3de/torch/nn/modules/rnn.py (L504)

3. Some variable is not **defined**. Despite being a relatively minor issue in annotation, it can lead to significant confusion for those who are new to the concept. For example `weight_ih` in line 499

fb55bac3de/torch/nn/modules/rnn.py (L499)

So, i write a runnable version to make it more clear:

```python
# Efficient implementation equivalent to the following with bidirectional=False
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
def forward(x, hx=None, batch_first=False):
    if batch_first:
        x = x.transpose(0, 1)
    seq_len, batch_size, _ = x.size()
    if hx is None:
        hx = torch.zeros(rnn.num_layers, batch_size, rnn.hidden_size)
    h_t_minus_1 = hx.clone()
    h_t = hx.clone()
    output = []
    for t in range(seq_len):
        for layer in range(rnn.num_layers):
            input_t = x[t] if layer == 0 else h_t[layer - 1]
            h_t[layer] = torch.tanh(
                input_t @ params[f"weight_ih_l{layer}"].T
                + h_t_minus_1[layer] @ params[f"weight_hh_l{layer}"].T
                + params[f"bias_hh_l{layer}"]
                + params[f"bias_ih_l{layer}"]
            )
        output.append(h_t[-1].clone())
        h_t_minus_1 = h_t.clone()
    output = torch.stack(output)
    if batch_first:
        output = output.transpose(0, 1)
    return output, h_t
```

This code can reproduce the computation of torch.nn.RNN.

For example:

```python
import torch
import torch.nn as nn

torch.manual_seed(0)
input_size, hidden_size, num_layers = 3, 5, 2
rnn = nn.RNN(input_size, hidden_size, num_layers)
params = dict(rnn.named_parameters())
x = torch.randn(10, 4, 3)

official_imp = rnn(x)
my_imp = forward(x)

assert torch.allclose(official_imp[0], my_imp[0])
assert torch.allclose(official_imp[1], my_imp[1])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153620
Approved by: https://github.com/mikaylagawarecki
2025-05-21 22:45:28 +00:00
5b6e551c0f [AOTI][refactor] Fix an anonymous namespace issue (#154033)
Summary: Remove anonymous namespace in model_container.h to fix the following compiler warning,
```
warning: ‘torch::aot_inductor::AOTInductorModelContainer’ has a field ‘torch::aot_inductor::AOTInductorModelContainer::constant_folded_’ whose type uses the anonymous namespace [-Wsubobject-linkage]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154033
Approved by: https://github.com/chenyang78
2025-05-21 22:29:09 +00:00
d356ca2466 [map] add inductor support by lowering to while_loop (#150971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150971
Approved by: https://github.com/zou3519
ghstack dependencies: #151034
2025-05-21 22:19:47 +00:00
cf1b38a017 [map] make proxy mode re-dispatch to fake key (#151034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151034
Approved by: https://github.com/zou3519
2025-05-21 22:19:47 +00:00
c13eeaa718 Move inductor benchmark jobs to CUDA 12.8 (#154004)
For benchmark jobs, we usually want to run with the latest support CUDA version to get the best performance. This is a request coming from NVIDIA where they are running inductor benchmarks on Blackwell with CUDA 12.8 (min support version) and looking for an apples-to-apples comparison.

This also clean up references to CUDA 12.4 which have been sunset in PyTorch CI.

### Testing

- H100 benchmark https://github.com/pytorch/pytorch/actions/runs/15151424588
- Micro benchmark https://github.com/pytorch/pytorch/actions/runs/15151445957 (I just realize that this is still running on A100, @yanboliang Do you want to run on H100 now that we have capacity there?  It would also solve the problem of GPU memory)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154004
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/seemethere
2025-05-21 22:17:10 +00:00
053ca7439a [cutlass backend] Add serializer for cutlass ops (#153894)
Differential Revision: [D74524786](https://our.internmc.facebook.com/intern/diff/D74524786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153894
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-05-21 22:01:40 +00:00
401fa87ace make only current thread allocate to pool in NcclPG (#153990)
follow up to #153356 that fixes nccl allocation to pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153990
Approved by: https://github.com/kwen2501
2025-05-21 21:57:37 +00:00
28bcd9eb30 [aoti] Initial Metal support (#153959)
An example generated file: P1816629015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959
Approved by: https://github.com/malfet, https://github.com/desertfire
ghstack dependencies: #153964
2025-05-21 21:55:59 +00:00
918ae5d361 [aoti] Add MPS runner and shim (#153964)
Added AOTIModelContainerRunnerMps and a shim for mps fallback ops.
I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel:

```
AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg(
    AOTIMetalKernelFunctionHandle func,
    unsigned idx,
    AtenTensorHandle tensor);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964
Approved by: https://github.com/malfet, https://github.com/desertfire
2025-05-21 21:55:59 +00:00
0b79a8c1a9 [dynamo] renamed _fn for more clarity and put a comment of user compiler user (#154026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154026
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-21 21:12:51 +00:00
0e5f2339d0 [ROCm][Windows] Run hipcc with compatibility flags. (#153986)
See also https://github.com/ROCm/TheRock/issues/590. Including the `-Wno-ignored-attributes` flag here avoids 700MB of log warning spam while compiling and the `-fms-extensions` seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html.

Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>
Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153986
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily

Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>
2025-05-21 20:26:52 +00:00
3c89cfd460 cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-21 20:23:04 +00:00
4c6f0fe22f [dynamo] Properly handle torch.script.jit under @staticmethod (#153984)
Fixes #153607.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153984
Approved by: https://github.com/williamwen42
2025-05-21 19:45:06 +00:00
b184e3da9c [easy] Fix internal only test (#154035)
Internally static cuda launcher isn't enabled, so we need to always enable it

Differential Revision: [D75146584](https://our.internmc.facebook.com/intern/diff/D75146584/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154035
Approved by: https://github.com/Skylion007
ghstack dependencies: #153565
2025-05-21 19:00:55 +00:00
8e6e79fc1b [hop_schema] support gen_schema for invoke_subgraph (#152984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152984
Approved by: https://github.com/zou3519
ghstack dependencies: #151067, #152974
2025-05-21 18:55:46 +00:00
9c33899196 [hop_schema] add HopSchemaGenerator to make it easier to create hop schema (#152974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152974
Approved by: https://github.com/zou3519
ghstack dependencies: #151067
2025-05-21 18:55:46 +00:00
1e0f19e173 auto functionalize base_hop (#151067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151067
Approved by: https://github.com/zou3519
2025-05-21 18:55:46 +00:00
11c0ffefcd Cache code generation during triton template expansion and enable it for mm_template. (#151773)
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.

with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.

This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.

How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.

Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
2025-05-21 18:55:41 +00:00
aec7cc60d7 add graph_code_verbose_log artifact for fx passes (#153775)
Fixes #153646

This PR refactors the logging behavior in the FX pass insert_deferred_runtime_asserts and runtime_assert.py to separate verbose/intermediate graph logs from the final output graph log. All verbose logs generated during the FX pass are now routed to a new artifact logger, graph_code_verbose, while only the final output graph remains logged to the original graph_code artifact.

Changes

- Added a new artifact logger: [graph_code_log = torch._logging.getArtifactLogger(__name__, "graph_code_verbose")]
- Updated all verbose/intermediate FX pass logs in [insert_deferred_runtime_asserts] to use the new graph_code_verbose artifact.
- Ensured that only the final output graph is logged to the original graph_code artifact.
- No changes to the FX pass logic or output—only logging behavior is affected.

Notes
This change is backward-compatible and does not affect the functional behavior of FX passes.
No changes to user-facing APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153775
Approved by: https://github.com/williamwen42
2025-05-21 18:31:59 +00:00
d23f4ae7b5 s390x: use qemu issue workaround for runner registration too (#154030)
s390x: use qemu issue workaround for runner registration too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154030
Approved by: https://github.com/seemethere
2025-05-21 18:30:25 +00:00
bb7e30c165 [MegaCache] Make MegaCache generic to allow external plugins registration (#152977)
Implements #152976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152977
Approved by: https://github.com/oulgen
2025-05-21 18:18:47 +00:00
c31e239910 [precompile] Add BundledAOTAutogradCacheEntry (#152840)
Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.

This has some advantages:
- No more dependency on FxGraphCache at all
- Clearing FxGraphCache does not result in AOTAutogradCache miss
- Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output

We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this *could* be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.

On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.

Differential Revision: [D74593304](https://our.internmc.facebook.com/intern/diff/D74593304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152840
Approved by: https://github.com/zhxchen17
2025-05-21 18:08:42 +00:00
3eb8fa081a Revert "[3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)"
This reverts commit 32b1baa981fe53d13d77acbee509c51087abf107.

Reverted https://github.com/pytorch/pytorch/pull/153802 on behalf of https://github.com/malfet due to It breaks ROCM testing, see d23762974e/1 ([comment](https://github.com/pytorch/pytorch/pull/153802#issuecomment-2898695702))
2025-05-21 17:20:31 +00:00
d23762974e [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-21 17:12:05 +00:00
72a3c8dfa8 [AOTI][reland] Add an option to specify custom op C shim (#153968)
Summary: Reland https://github.com/pytorch/pytorch/pull/153851 after fixing a fuzzer test issue.

Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153968
Approved by: https://github.com/hl475
2025-05-21 15:57:57 +00:00
b7d08defe9 [BE]: Type previously untyped decorators (#153726)
This fixes decorator typing which unmasks a lot of typing issues in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153726
Approved by: https://github.com/albanD
2025-05-21 15:56:19 +00:00
6c2c527cd6 [BE] Remove extra semicolons from SymmetricMemory.hpp (#154034)
Fixes
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.cpp:1:
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:77:4: warning: extra ';' after member function definition [-Wextra-semi]
   77 |   };
      |    ^
/Users/malfet/git/pytorch/pytorch/torch/csrc/distributed/c10d/SymmetricMemory.hpp:81:4: warning: extra ';' after member function definition [-Wextra-semi]
   81 |   };
      |    ^
2 warnings generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154034
Approved by: https://github.com/Skylion007
2025-05-21 14:33:30 +00:00
33767eb391 Add option to statically launch user defined triton kernels (#153725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153725
Approved by: https://github.com/oulgen, https://github.com/Mingming-Ding, https://github.com/jansel
ghstack dependencies: #153565
2025-05-21 14:33:15 +00:00
b73d77900e [CI] Run limited h100 tests every 6 hours (#153900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153900
Approved by: https://github.com/Skylion007
2025-05-21 13:40:03 +00:00
11f8511455 Update torch-xpu-ops commit pin (#153902)
Update the torch-xpu-ops commit to defce46ae7, includes:

- Resolve the aten::gamma accuracy gap compared to scipy
- Optimize layernom_vectorized_impl by using adaptive wg selection for small shapes
- [Intro async flag and use current stream avoid stream sync](https://github.com/intel/torch-xpu-ops/pull/1546)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153902
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-21 13:29:41 +00:00
616e20b4f1 [Intel GPU] scalar tensor case handling in addmm, baddmm (#153051)
# Motivation
This PR adds scalar tensor (`t.elem()==1 and t.sizes().empty() == true and t.dim()=0` )handling in addmm, baddmm. The issue is found during the process of oneDNN upgradation. Found that the new version of oneDNN requires the post-binary (self in addmm) has same dimension as the one of output tensor. Now we need explicitly expand the shape of `self` tensor. Former version dnnl may help us do the broadcasting inside.

This PR could fix issues in https://github.com/intel/torch-xpu-ops/issues/1612 and CI error in https://github.com/pytorch/pytorch/pull/151767.

# Implementation
We treat the scalar tensor as normal tensor by `unsqueeze` it as 1 dimension tensor. Accompanied with the existing shape handle logic, it would be further `unsqueeze` to 2D or 3D shape.

UT testing
```
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmm_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_addmv_xpu_float32
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float16
python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoXPU.test_comprehensive_baddbmm_xpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153051
Approved by: https://github.com/EikanWang, https://github.com/guangyey
2025-05-21 12:24:37 +00:00
afd7a13bca Migrate to new Windows Arm64 runners (#152099)
This PR moves the Windows Arm64 nightly jobs to the new runner image, see [arm-windows-11-image](https://github.com/actions/partner-runner-images/blob/main/images/arm-windows-11-image.md )

Fixes #151671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152099
Approved by: https://github.com/seemethere
2025-05-21 09:13:15 +00:00
ffd49d538e [BE][Ez]: Improve typing in torch/modules/container.py (#153728)
Adds some missing type annotations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153728
Approved by: https://github.com/albanD
2025-05-21 07:15:00 +00:00
a636a92ee9 [MTIA ATen Backend] Migrate "_unsafe_view" and "view" ops from out-of-tree to pytorch in-tree (#153670)
Summary:
# Context
The MTIA New Aten Backend work is essentially to move MTIA operators from pytorch out-of-tree to in-tree, with following benefits:
1. Avoid duplicate code copied from pytorch, e.g. view ops implementation, util functions.
2. Utilize TensorIterator and structured kernel codegen, avoid manual implementation of broadcasting, dtype casting, asserting, etc.
3. Eliminate MTIA's own codegen flow, which is unnecessary complexity.
4. Overall make MTIA's aten backend more pytorch native.

Differential Revision: D74672464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153670
Approved by: https://github.com/albanD, https://github.com/nautsimon
2025-05-21 05:20:45 +00:00
dcb3edd30d [AOTI][XPU] Refactor AOTInductor runtime API for Intel GPU. (#153929)
Simplify and improve code format for sycl_runtime_wrappers.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153929
Approved by: https://github.com/desertfire
ghstack dependencies: #153924
2025-05-21 03:52:54 +00:00
531d8f5fb6 Revert "[cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)"
This reverts commit 2b43d635d31f1338743885efd1a259f43bd2ee65.

Reverted https://github.com/pytorch/pytorch/pull/153556 on behalf of https://github.com/eqy due to reverting, will add disable for reduced precision reduction ([comment](https://github.com/pytorch/pytorch/pull/153556#issuecomment-2896257521))
2025-05-21 02:09:11 +00:00
1478d0185c Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)"
This reverts commit 8cabd23b3d357ec38a400978bb5423efcb433f2a.

Reverted https://github.com/pytorch/pytorch/pull/151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/151594#issuecomment-2896230131))
2025-05-21 01:45:20 +00:00
daa68e7a93 Update USE_XCCL option if USE_XPU is OFF (#153936)
# Motivation
Disable `USE_XCCL` when `USE_XPU` is turned `OFF` to ensure configuration consistency. This is required because XCCL depends on XPU functionality.
Especially, ensure that `USE_XCCL` is correctly set to `OFF` when [caffe2_update_option(USE_XPU OFF)](1075bb37d3/cmake/Dependencies.cmake (L97)) is invoked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153936
Approved by: https://github.com/Skylion007
2025-05-21 01:32:41 +00:00
cf6e5d1881 Revert "[cuBLASLt] relax addmm cuBLASLt constraint (#153675)"
This reverts commit f9bb7cf72a43bec17b5fc2ccbe865aa130e760be.

Reverted https://github.com/pytorch/pytorch/pull/153675 on behalf of https://github.com/eqy due to incorrect, cuBLASLt doesnt handle beta != 1.0 but this appears untested ([comment](https://github.com/pytorch/pytorch/pull/153675#issuecomment-2896188784))
2025-05-21 01:20:10 +00:00
fe49b11e09 Add memory reporting for XPU to Memory Profiler (#152842)
Adds support for XPU profile_memory in Pytorch Profiler.

Currently, when `profile_memory=True` is passed to `torch.profiler.profile`, there is no XPU memory reported. For example, the profiling table printed by the code below is missing any `XPU Mem` columns:

<details><summary>profiling.py</summary>
<p>

```python
import torch
import torch.nn as nn
import torch.optim as optim

from torch.profiler import profile, ProfilerActivity

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.conv1 = nn.Conv1d(20,20,15,padding="same")
        self.flatten = nn.Flatten()
        self.net1 = nn.Linear(2048, 4096)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(4096, 5)

    def forward(self, x):
        res = self.conv1(x)
        res = self.flatten(res)
        res = self.net1(res)
        return self.net2(self.relu(res))

def demo_basic():
    model = ToyModel().to("xpu")
    loss_fn = nn.MSELoss().to("xpu")
    optimizer = optim.SGD(model.parameters(), lr=0.001)

    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.XPU], profile_memory=True) as prof:
        for epoch in range(10):
            optimizer.zero_grad()
            outputs = model(torch.randn(20, 2048).to("xpu"))
            labels = torch.randn(20, 5).to("xpu")
            loss_fn(outputs, labels).backward()
            optimizer.step()
    print(prof.key_averages().table(max_name_column_width=100, sort_by="xpu_time_total", row_limit=100))

if __name__ == "__main__":
    demo_basic()
```
</p>
</details>

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            gemm_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.501ms        44.73%       1.501ms      25.024us           0 b           0 b            60
    autograd::engine::evaluate_function: AddmmBackward0         0.12%       1.067ms        30.47%     260.929ms      13.046ms       0.000us         0.00%       1.009ms      50.448us           0 b           0 b            20
                                         AddmmBackward0         0.09%     744.983us        15.99%     136.944ms       6.847ms       0.000us         0.00%     784.640us      39.232us           0 b           0 b            20
                                               aten::mm        15.41%     131.956ms        15.79%     135.167ms       3.379ms     784.640us        23.37%     784.640us      19.616us           0 b           0 b            40
                                           aten::linear         0.02%     156.361us        20.58%     176.187ms       8.809ms       0.000us         0.00%     741.760us      37.088us           0 b           0 b            20
                                            aten::addmm        20.25%     173.371ms        20.52%     175.723ms       8.786ms     741.760us        22.10%     741.760us      37.088us           0 b           0 b            20
                                Optimizer.step#SGD.step         0.40%       3.429ms         5.55%      47.509ms       4.751ms       0.000us         0.00%     488.960us      48.896us           0 b           0 b            10
                                    aten::_foreach_add_         4.81%      41.162ms         5.15%      44.080ms       4.408ms     488.960us        14.57%     488.960us      48.896us           0 b           0 b            10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n...         0.00%       0.000us         0.00%       0.000us       0.000us     422.880us        12.60%     422.880us      42.288us           0 b           0 b            10
autograd::engine::evaluate_function: ConvolutionBack...         0.03%     280.041us         4.36%      37.328ms       3.733ms       0.000us         0.00%     356.320us      35.632us           0 b           0 b            10
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 856.227ms
Self XPU time total: 3.357ms
```

This PR updates the XPUCachingAllocator.cpp to report allocation events to the Profiler, and causes these to be printed in the table:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg       CPU Mem  Self CPU Mem       XPU Mem  Self XPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            gemm_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.436ms        43.64%       1.436ms      23.939us           0 b           0 b           0 b           0 b            60
    autograd::engine::evaluate_function: AddmmBackward0         0.13%       1.186ms        29.92%     262.875ms      13.144ms       0.000us         0.00%       1.005ms      50.272us           0 b           0 b     320.94 Mb      -4.69 Mb            20
                                         AddmmBackward0         0.09%     815.288us        16.48%     144.802ms       7.240ms       0.000us         0.00%     790.720us      39.536us           0 b           0 b     325.47 Mb           0 b            20
                                               aten::mm        15.86%     139.342ms        16.26%     142.875ms       3.572ms     790.720us        24.03%     790.720us      19.768us           0 b           0 b     325.47 Mb     325.47 Mb            40
                                           aten::linear         0.02%     182.856us        20.46%     179.775ms       8.989ms       0.000us         0.00%     669.440us      33.472us           0 b           0 b       3.13 Mb           0 b            20
                                            aten::addmm        20.10%     176.607ms        20.40%     179.210ms       8.961ms     669.440us        20.34%     669.440us      33.472us           0 b           0 b       3.13 Mb       3.13 Mb            20
                                Optimizer.step#SGD.step         0.42%       3.692ms         5.61%      49.267ms       4.927ms       0.000us         0.00%     486.640us      48.664us           0 b           0 b           0 b           0 b            10
                                    aten::_foreach_add_         4.83%      42.439ms         5.19%      45.574ms       4.557ms     486.640us        14.79%     486.640us      48.664us           0 b           0 b           0 b     -20.00 Kb            10
at::native::xpu::MultiTensorApplyKernelFunctor<at::n...         0.00%       0.000us         0.00%       0.000us       0.000us     420.960us        12.79%     420.960us      42.096us           0 b           0 b           0 b           0 b            10
autograd::engine::evaluate_function: ConvolutionBack...         0.04%     310.719us         4.47%      39.279ms       3.928ms       0.000us         0.00%     339.520us      33.952us           0 b           0 b      -2.89 Mb      -3.12 Mb            10
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 878.627ms
Self XPU time total: 3.291ms
```

These XPU memory numbers match the same profiling results on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152842
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-05-21 01:19:19 +00:00
8817e5ac80 Render Example: and not Example:: in docs (#153978)
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.

The correct notation is:
```
Example::

    >>> ...
```

It is common and wrong to have:
```
Example::
    >>> ...
```

In the wrong example, we get these pesky double colons:
![image](https://github.com/user-attachments/assets/20ffd349-68bb-4552-966c-e23923350476)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer, https://github.com/malfet
2025-05-21 01:03:26 +00:00
0959869683 Bump triton pin for the release 3.3.1 of triton (#153951)
Triton is pointing to latest triton pin : https://github.com/triton-lang/triton/tree/release/3.3.x
XPU pointing to latest XPU pin: https://github.com/intel/intel-xpu-backend-for-triton/commits/release/3.3.x/

This version contains the fix for: Compilation Issue for RTX 5090 GPUs with Compute Capability = 120. https://github.com/triton-lang/triton/pull/6771
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153951
Approved by: https://github.com/davidberard98
2025-05-21 00:27:39 +00:00
32b1baa981 [3/n][Optimus][Auto-AC] Support float8_e4m3fn quantization type and set scaling as the default (#153802)
Summary:
1. Customers now can test with float8_e4m3fn.
2. To play safe, we set the scaling version as the default.

Test Plan:
### unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Buck UI: https://www.internalfb.com/buck2/f679f362-8bf4-454c-87df-a85cbc2ab2a8
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549861047443
Network: Up: 16KiB  Down: 3.9MiB  (reSessionID-98badbfd-76f7-487f-ab1c-1ec4f850614d)
Analyzing targets. Remaining     0/281
Executing actions. Remaining     0/5957                                                                                                   7.3s exec time total
Command: test.     Finished 3 local, 1 remote
Time elapsed: 1:29.7s
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D74910193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153802
Approved by: https://github.com/nareshrajkumar866, https://github.com/Hahu803, https://github.com/Mingming-Ding
2025-05-21 00:21:54 +00:00
58dc80dff6 [MPSInductor] Fix indexing calculation (#153997)
By using `c10:🤘:floor_divie` primitive

Which fixes `test_flip_cat_mps` test, and makes `doctr_reco_predictor` and `doctr_det_predictor` pass accuracy checks (at least locally, scheduled a workflow dispatch to validate it in CI)

Before this change following script generated different compile and eager results
```python
import torch

def foo(unsqueeze, unsqueeze_1):
    cat_1 = torch.ops.aten.cat.default([unsqueeze, unsqueeze_1], 1)
    view = torch.ops.aten.view.default(cat_1, [4])
    slice_5 = torch.ops.aten.slice.Tensor(view, 0, 0, 3)
    rev_1 = torch.ops.aten.flip.default(slice_5, [0])
    return rev_1

if __name__ == "__main__":
    x = torch.arange(1.0, 3.0, device='mps').reshape(2, 1)
    y = torch.arange(5.0, 7.0, device='mps').reshape(2, 1)

    rc, (kernel,) = torch._inductor.utils.run_and_get_kernels(torch.compile(foo), x, y)
    print(kernel)
    print("Compile: ", rc)
    print("Eager: ", foo(x, y))
```
After this change
```
'''
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        constant float* in_ptr1,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp6 = in_ptr0[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp11 = in_ptr1[1 + (c10:🤘:floor_divide((-1)*x0, 2))];
        auto tmp0 = (2 + ((-1)*x0)) % (2);
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 1;
        auto tmp5 = tmp1 < tmp4;
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2;
        auto tmp10 = tmp1 < tmp9;
        auto tmp12 = tmp8 ? tmp11 : 0.0;
        auto tmp13 = tmp5 ? tmp7 : tmp12;
        out_ptr0[x0] = static_cast<float>(tmp13);
    }
'''
Compile:  tensor([2., 5., 1.], device='mps:0')
Eager:  tensor([2., 5., 1.], device='mps:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153997
Approved by: https://github.com/dcci
ghstack dependencies: #153970, #153971
2025-05-21 00:03:46 +00:00
fc33da410f Add torch/header_only_apis.txt and enforce they're tested (#153635)
This PR adds enforcement of testing header only APIs.

The benefit of torch/header_only_apis.txt is twofold:
1) this gives us a clear view of what we expect to be header only
2) this allows us to enforce testing

The enforcement added in this PR is very basic--we literally string match that a symbol in `torch/header_only_apis.txt` is in a cpp test. This is meant to be a first step in verifying our APIs are properly tested and can get fancier over time. For now, I've added myself as a codeowner to learn what to look out for in terms of proper tests. Over time, I anticipate we can automate more steps, but right now let's just get something out the door.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153635
Approved by: https://github.com/albanD
ghstack dependencies: #153965
2025-05-20 23:42:24 +00:00
41a9aa6564 Remove janky (though at times useful) dlclose test (#153975)
This test was never the shining star in class but it helped check that we can properly delete a stable library. But now that we are running it in CI this is not a good test to annoy people with as dlclose + parallelism is likely not the move. I will miss it locally though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153975
Approved by: https://github.com/jbschlosser
2025-05-20 23:26:42 +00:00
7b7604fdb4 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit 0c04492e3b142854fad8356a2a4d74f12e2c6c5d.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/malfet due to Breaks lint, see 3742b7fb3a/1 ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2896031661))
2025-05-20 23:12:11 +00:00
3742b7fb3a Treat dim=[] same as dim=None (#153570)
Fixes https://github.com/pytorch/pytorch/issues/153568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153570
Approved by: https://github.com/ngimel
2025-05-20 22:44:29 +00:00
f7b8eadd9d Add codeowner for merge rules (#152354)
To ensure changes to merge rights are properly reviewed
Also make the codeowner file valid by removing invalid users
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152354
Approved by: https://github.com/malfet
2025-05-20 22:24:23 +00:00
0c04492e3b [inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)
Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335
Approved by: https://github.com/eellison
2025-05-20 22:19:02 +00:00
2c2524f74b [AOTI] Generate unique cubin file names when package_cpp_only (#153948)
Summary:
* When package_cpp_only is specified, generate kernel file names with unique kernel names to make the final packaged package files more readable. Assert on unique_kernel_names in case somehow it was explicitly set to False.
* Fix a rocm test skip, see https://github.com/pytorch/pytorch/pull/153828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153948
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-05-20 22:07:53 +00:00
8cabd23b3d [CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever
2025-05-20 21:56:47 +00:00
2b43d635d3 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 21:51:49 +00:00
aeb734f519 [nativert] Move GraphSignature to pytorch core (#152969)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading.

The serialization schema is defined in torch/_export/serde/schema.py
See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature

Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp`

Differential Revision: D73895378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969
Approved by: https://github.com/swolchok
2025-05-20 21:49:56 +00:00
8f943046f8 [BE] light cleanups to linter logic (#153965)
some BE cleanup on other lint things I saw while doing the top of the this stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153965
Approved by: https://github.com/soulitzer
2025-05-20 21:28:48 +00:00
deaf6c2f2f Address the ignored warnings for -Wmissing-field-initializers in the file fbcode/caffe2/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (#153958)
Summary:
the error message  https://www.internalfb.com/sandcastle/workflow/698057942249983018/artifact/actionlog.698057942382778255.stderr.1?selectedLines=66-66-70-148 from D74892646

When switching the host compiler to Clang, maybe we should only silence these warnings in this file.

Test Plan: sandcastle_green

Differential Revision: D75029051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153958
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-05-20 21:25:56 +00:00
6cb7e4b5a5 [EZ] Update mps xfail reason (#153971)
cummin is not implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153971
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #153970
2025-05-20 21:15:14 +00:00
03859242ce [Testing] Fix test_deterministic_... on MPS (#153970)
By decorated emitted kernels with `'''` rather than `"""`

To match regex in `torch._inductor.utils.run_and_get_kernels`
This fixes `test_deterministic_codegen_mps`, `test_deterministic_codegen_on_graph_break_mps` and `test_deterministic_codegen_with_suffix_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153970
Approved by: https://github.com/dcci, https://github.com/jansel
2025-05-20 21:15:14 +00:00
3aa95b252a Fix test_side_stream_backward_overlap flakiness (#153963)
Fixes https://github.com/pytorch/pytorch/issues/153927

Although the autograd backward should always execute SideBw before MainBw, there is still a small chance the recorded events won't be in that order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153963
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
ghstack dependencies: #151079, #153412
2025-05-20 21:02:56 +00:00
500a710422 Revert "Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)"
This reverts commit 2e56ce097a201ff3c69610cea953a9efce17d1b1.

Reverted https://github.com/pytorch/pytorch/pull/153245 on behalf of https://github.com/yangw-dev due to tests failed internally [D75078034](https://www.internalfb.com/diff/D75078034) ([comment](https://github.com/pytorch/pytorch/pull/153245#issuecomment-2895785642))
2025-05-20 20:45:55 +00:00
179e7d8624 Fix vs2022 caused AVX512 illegal instruction issue. (#153480)
Fixes #145702

Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine.

Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459

Local test passed:
<img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480
Approved by: https://github.com/Blackhex, https://github.com/cyyever, https://github.com/atalman
2025-05-20 20:37:00 +00:00
996c4d803d Removing conda references from PyTorch Docs (#152702)
Addresses #148339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152702
Approved by: https://github.com/svekars, https://github.com/albanD, https://github.com/atalman
2025-05-20 20:33:28 +00:00
05bc78e64f [submodule] Update fbgemm pinned version (#153950)
Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751

Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec

Test Plan: PyTorch OSS CI

Differential Revision: D75073516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-20 20:24:27 +00:00
eqy
823a35807c [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-20 20:19:03 +00:00
e8f8baf71f set CUDA_MODULE_LOADING for older drivers only (#152695)
`CUDA_MODULE_LOADING=LAZY` is the default for all drivers shipped with CUDA >=12.2 and we should check the driver version before setting the env variable.

(the `LOG(WARNING)` has to be removed before merging)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152695
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/nWEIdia
2025-05-20 19:34:40 +00:00
7587350458 Make python_agnostic cpp extension tests standalone (#153274)
Related: #148920

This PR:
* Introduces a new file `test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py` with testing that follows the usual python testing patterns
    * This replaces the testing for python_agnostic in `test/test_cpp_extensions_aot.py`

After this PR, it is now possible to run:
```
python test/cpp_extensions/python_agnostic_extension/test/test_python_agnostic.py
```

and the test will build the prerequisite wheel before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153274
Approved by: https://github.com/janeyx99, https://github.com/cyyever
ghstack dependencies: #153264
2025-05-20 19:18:09 +00:00
3ecd444004 Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264)
Related: #148920

This PR:
* Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()`
    * Updates libtorch_agnostic tests to use this
* Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py`
    * Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now

After this PR, it is now possible to run:
```
python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py
```

and the test will build the `libtorch_agnostic` extension before running the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264
Approved by: https://github.com/janeyx99
2025-05-20 19:18:09 +00:00
f1f54c197d [c10d] Simplify new_subgroups() by using new_subgroups_by_enumeration() (#153843)
Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`.

Test Plan: contbuild & OSS CI

Differential Revision: D75007368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843
Approved by: https://github.com/Skylion007, https://github.com/wz337
2025-05-20 19:15:20 +00:00
2d20106922 [inductor] Support cutlass backend with remote execution (#153844)
Meta-internal builds need to use RE to build with nvcc, since the
trainers do not have nvcc (and its attendant build toolchain) installed.

This diff enables building using an RE service (via the same code path used for
Triton)

Differential Revision: [D74907192](https://our.internmc.facebook.com/intern/diff/D74907192/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153844
Approved by: https://github.com/henrylhtsang
2025-05-20 19:05:23 +00:00
e0f8174001 [triton][fb] Move build_paths into triton_utils (#153652)
Summary: TSA, this is just a small cleanup

Test Plan: CI

Differential Revision: D74835506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153652
Approved by: https://github.com/Skylion007
2025-05-20 18:59:50 +00:00
f9bb7cf72a [cuBLASLt] relax addmm cuBLASLt constraint (#153675)
`beta == 1.0` doesn't seem to be required anymore

https://github.com/pytorch/pytorch/issues/153590

`self.dim() == 1` restriction seems to still hold but not sure if that's due to a lack of handling on the PyTorch side or the cuBLASLt side, will investigate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153675
Approved by: https://github.com/Skylion007
2025-05-20 18:43:38 +00:00
7c9d94e9bb Redirect mobile_optimizer.rst to executorch (#153664)
Redirect mobile_optimizer.rst to executorch

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153664
Approved by: https://github.com/byjlw, https://github.com/malfet
2025-05-20 18:13:45 +00:00
0087f5f0af [AOTI][XPU] Embed SPRI-V files into .so (#153924)
Following the design of #150739, this PR supports embed kernel SPIR-V files so AOTI is one step closer to generate a single binary.
Fixes #153829
Fixes #153830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153924
Approved by: https://github.com/desertfire
2025-05-20 17:38:53 +00:00
335c89c6f1 [Monitoring] enable local logs and add mac test monitoring (#153454)
Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454
Approved by: https://github.com/huydhn
2025-05-20 17:14:40 +00:00
b910d37ec6 [cutlass backend] Reduce log level for cutlass runtime error (#153457)
Want to make sure we always call self.cleanup_run_fn() even if we crash.

I think this is the reason why sometimes we get
```
in _dlclose
TypeError: 'NoneType' object is not callable
```

Differential Revision: [D74629230](https://our.internmc.facebook.com/intern/diff/D74629230/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153457
Approved by: https://github.com/ColinPeppler
2025-05-20 17:03:17 +00:00
6b5b69a468 [Memory Snapshot] Fix RecordFunction Callback Handling (#153839)
Fixes #153571
Summary:
1. Set annotation callback to global to include all threads
2. Only init callbacks when enable == true and callbacks are empty under mutex
3. When enable == false, check if callbacks are present and if so remove them and set handle to 0 under mutex

We don't expect memory snapshots to be called from several different threads (almost always called just from main) but we make sure to add thread safety in the off case that users do want to call it from different points of entry

Test Plan: Ran basic snapshot and saw that the callbacks were registered properly

Reviewed By: ngimel

Differential Revision: D74771491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153839
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-05-20 17:01:00 +00:00
ddfaab3b56 [aoti] Reset expr when generating cpp code (#153898)
Maybe fixes https://github.com/pytorch/pytorch/issues/153896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153898
Approved by: https://github.com/desertfire
2025-05-20 16:31:25 +00:00
5163bf0069 [CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655)
Some tests may not set the preferred backend, which leads to unexpected behavior when multiple tests are run vs. standalone

Tests that should exercise both backends should explicitly parametrize this setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153655
Approved by: https://github.com/ngimel
2025-05-20 16:18:35 +00:00
a7c01d7f13 [Inductor] Subgraph check output strides (#153755)
Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous.

Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755
Approved by: https://github.com/eellison
ghstack dependencies: #153754
2025-05-20 16:07:18 +00:00
63e5d46478 [Inductor] Subgraph support dynamic input expressions (#153754)
Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp

Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754
Approved by: https://github.com/eellison
2025-05-20 16:07:18 +00:00
2e56ce097a Fixed an issue with XPU skip so the test_decompose_mem_bound_mm.py suite can be ran correctly (#153245)
Fixes #153239

Replaced custom decorator with the common one. Although the better way to skip the whole suite would be to add it to skip list in run_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153245
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jeffdaily
2025-05-20 15:46:21 +00:00
ef958fa152 [cuDNN][cuDNN frontend] upgrade cuDNN frontend submodule to 1.12 (#153888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153888
Approved by: https://github.com/Skylion007
2025-05-20 15:08:37 +00:00
3102ae6798 Revert "[AOTI] Add an option to specify custom op C shim (#153851)"
This reverts commit 365ac49840105918c604a6b1c7e81c1ca59e37fb.

Reverted https://github.com/pytorch/pytorch/pull/153851 on behalf of https://github.com/malfet due to Looks like it broke fuzzer test, but I could be wrong, see c4d1ff02f8/1 ([comment](https://github.com/pytorch/pytorch/pull/153851#issuecomment-2894619773))
2025-05-20 14:23:50 +00:00
c4d1ff02f8 [Lint] Update clang-format to 19.1.4 (#153889)
All changes other than the one to `tools/linter/adapters/s3_init_config.json` are generated by newer clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153889
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-05-20 14:12:46 +00:00
d869ea11e0 [BE]: Update fmtlib submodule to 11.2.0 (#153853)
Update fmtlib to 11.2.0 with a lot of miscellaneous fixes for various compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153853
Approved by: https://github.com/malfet
2025-05-20 14:11:18 +00:00
4b759d98f8 Recheck autotune cache on static cuda launcher load (#153565)
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.

Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.

Test Plan:
New unit test that failed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
2025-05-20 14:00:43 +00:00
d68d4d31f4 [Cutlass] EVT tests update (#153926)
Fixes internal EVT tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153926
Approved by: https://github.com/williamwen42
2025-05-20 10:03:10 +00:00
d44074f01a [Dynamo] Fix einops regression (#153925)
Fixes https://github.com/pytorch/pytorch/issues/153476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153925
Approved by: https://github.com/williamwen42
2025-05-20 09:52:42 +00:00
44f19c7179 Record the XPU and XCCL build settings in the compiled binary (#147161)
Fixes #ISSUE_NUMBER

Currently the XPU and XCCL build settings are not recorded in the compiled binary and are not shown using the `torch.__config__.show()` which is a quick way to check if the binary has been built with such support.

Below is the output adding them (see end of last line):

```
Python 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2025.1-Product Build 20250203 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
XPU backend  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=43eb39d7c832b5560f7bfa8d29cc7919ac21c0ca, CXX_COMPILER=/home/pkourdis/compilers/gcc-13.3.0/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-error=redundant-move -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1,
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147161
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-05-20 09:21:39 +00:00
1075bb37d3 Revert "Fix fake tensor caching when output has unbacked (#153034)"
This reverts commit cb5f31a4a164a4fa1eaa627f9b15cdc18aa95ef1.

Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2893059329))
2025-05-20 06:02:38 +00:00
9849c79fa2 Revert "FakeTensorMode dispatch shouldn't include bypass in exception context (#153780)"
This reverts commit aa84c037f0f473c91a79f48a5f278b7243f64b0e.

Reverted https://github.com/pytorch/pytorch/pull/153780 on behalf of https://github.com/malfet due to Reverting to clearly revert https://github.com/pytorch/pytorch/pull/153034, that seems to have introduced flakiness in MacOS inductor tests, see https://github.com/pytorch/pytorch/issues/153891 ([comment](https://github.com/pytorch/pytorch/pull/153780#issuecomment-2893053304))
2025-05-20 05:59:42 +00:00
365ac49840 [AOTI] Add an option to specify custom op C shim (#153851)
Summary: Add an option to tell AOTInductor codegen to generate C shim functions for certain custom ops instead of relying on ProxyExecutor. The lib that defines custom ops need to implement corresponding C shim functions.

Differential Revision: [D75014177](https://our.internmc.facebook.com/intern/diff/D75014177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153851
Approved by: https://github.com/hl475
2025-05-20 05:12:09 +00:00
89ebd29fdc [Dynamo] added warning message for tracing lru_cache wrapped functions (#153744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153744
Approved by: https://github.com/williamwen42
2025-05-20 04:08:29 +00:00
5ef90e14a3 [export] Remove unused constants (#153800)
An internal test case ran into a weird issue when exporting, where the model imported a file which creates tensor constants upon importing [(code ptr)](https://fburl.com/code/xwmhxm7n). This causes the tracer to create some tensor constants even though it's not used in the model code. This PR updates the lift_constant_tensors pass to remove constant nodes that are not being used instead of lifting them as tensor constants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153800
Approved by: https://github.com/dolpm, https://github.com/pianpwk
2025-05-20 03:15:27 +00:00
a79e621c1c [DDP] rebuilt bucket order when find_unused_parameters=true (#153404)
Differential Revision: D72437251

Enable to rebuild bucket order when find_unused_parameters=true.

It should be always better than not rebuilding bucket order when find_unused_parameters=True:

1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything

2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases:
    a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain
    b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter

it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2025-05-20 02:45:01 +00:00
8b94d30b26 [Testing] Benchmark more tests for MPSInductor (#153897)
And report HF tests as HF tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153897
Approved by: https://github.com/dcci
2025-05-20 02:41:38 +00:00
1627951f24 [3.13] Remove all profiler related skips (#153857)
As underlying issue were fixed by https://github.com/pytorch/pytorch/pull/153848

Fixes https://github.com/pytorch/pytorch/issues/142166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153857
Approved by: https://github.com/williamwen42
ghstack dependencies: #153848
2025-05-20 01:19:52 +00:00
b15720118a Revert "Cache code generation during triton template expansion and enable it for mm_template. (#151773)"
This reverts commit 9180bb187c0e4c3ab3654e765fe33ad4c75a2b1a.

Reverted https://github.com/pytorch/pytorch/pull/151773 on behalf of https://github.com/malfet due to It broke ROCm, see f9aa3bae8c/1 ([comment](https://github.com/pytorch/pytorch/pull/151773#issuecomment-2892587039))
2025-05-20 00:42:53 +00:00
f9aa3bae8c [Inductor][XPU] Fallback bmm to mm when batch == 1, align with cuda. (#153770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153770
Approved by: https://github.com/NikhilAPatel, https://github.com/EikanWang, https://github.com/jansel
2025-05-19 23:56:20 +00:00
d81217be2e Revert "Improve torch.ops typing (#153558)"
This reverts commit c5cba39d469151895cd0ecf7673b98e5072b69c2.

Reverted https://github.com/pytorch/pytorch/pull/153558 on behalf of https://github.com/yangw-dev due to Your diff will not be landed to fbcode since we suspect it caused the following breakage in an internal test:[D75007157](https://www.internalfb.com/diff/D75007157) for instance: tests_gpu/lookup_gpu_index_test.py:232:8 Undefined attribute [16]: torch._ops._OpNamespace has no attribute simple_index_mm_batch ([comment](https://github.com/pytorch/pytorch/pull/153558#issuecomment-2892506789))
2025-05-19 23:32:36 +00:00
701e22112d [PT2][Optimus][Observability] Refactor the logging to avoid excessive tlparse log (#153584)
Summary: context: https://fb.workplace.com/groups/943185660584207/permalink/1215335930035844/

Test Plan:
before: aps-aps-ig_v4_2t_2_make_baseline_30batch-735703723-f735706162

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-aps-ig_v4_2t_2_make_baseline_30batch-735703723-f735706162/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000&fbclid=IwZXh0bgNhZW0CMTEAAR575JfJZUtE7kQCqzIZVCYomv1q03JzuMFVok8qDA_FuGC8oZ6rhhb2EziSQA_aem_abITQJZQP45t51_r-J-cFw

Differential Revision: D74776025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153584
Approved by: https://github.com/jamesjwu
2025-05-19 22:57:29 +00:00
c3e14ecdcd [CachingHostAllocator] guard accesses to use_host_register by mutex (#153845)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153845
Approved by: https://github.com/mradmila, https://github.com/jeffdaily
2025-05-19 22:39:13 +00:00
41564803c2 [Docs] Mention version.txt change for patch releases (#153860)
Part of https://github.com/pytorch/pytorch/issues/151425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153860
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2025-05-19 22:35:33 +00:00
08e716fc70 [BE] Fix -Wextra-semi warning (#153887)
Introduced by https://github.com/pytorch/pytorch/pull/153645

Semicolon is not needed after closing curly bracket defining a class method.

Not sure why CI did not catch it, but my local builds are now erroring out with
```
[19/97] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/passes/dead_code_elimination.cpp.o
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/passes/dead_code_elimination.cpp:4:
/Users/nshulga/git/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.h:356:64: warning: extra ';' after member function definition [-Wextra-semi]
  356 |   ValueAndMemoryLocationSet(const AliasDb* db) : aliasDb_(db){};
      |                                                                ^
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153887
Approved by: https://github.com/wdvr, https://github.com/davidberard98
2025-05-19 22:25:03 +00:00
f419067e50 [ROCm] improve sparse addmm, enable complex (#153262)
PR to:
- enable complex data types for sparse matmul on ROCm
- fix sparse addmm/baddbmm on ROCm
- fix sparse hipification for ROCm
- fix/enable sparse tests on ROCm (~40 tests total):
```
test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_*
test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_*
test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS*
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153262
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-05-19 22:23:18 +00:00
91cc93deae [CI] Reuse old whl (#153838)
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl.  This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused.  See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?

To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist

Pros:
* build time -> 6 min whenever this can be done

Cons:
* not sure if I have the right files
* version + whl name still remains the same

Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
2025-05-19 21:47:33 +00:00
c0343b1539 Fix profiler on cpython-3.13 (#153848)
Per [PEP 667](https://peps.python.org/pep-0667/) `PyFrame_GetLocals` no longer returns dict, but rather instance of `PyFrameLocalsProxy_Type`, so calling `PyDict_GetItemString` is no longer valid(it will always return None) and must be replaced with `PyMapping_GetItemString`

Tested by partially reverting https://github.com/pytorch/pytorch/pull/141674 full revert will be done in the followup PR

Fixes https://github.com/pytorch/pytorch/issues/148273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153848
Approved by: https://github.com/Skylion007
2025-05-19 21:20:53 +00:00
8c40c9ffcb Revert "[CI] Reuse old whl (#153838)"
This reverts commit cc48550e6f6fa8888b7d90d030b36c4e6d6581ab.

Reverted https://github.com/pytorch/pytorch/pull/153838 on behalf of https://github.com/clee2000 due to testing on main is hard ([comment](https://github.com/pytorch/pytorch/pull/153838#issuecomment-2892272494))
2025-05-19 21:13:27 +00:00
a237831bc2 [JIT] Optimize DCE by storing a MemoryLocations for an entire set<Value*> (#153645)
Summary:
**TL;DR**: make DCE faster by replacing a Set<Value*> with a MemoryLocations sparse bitset (representing all the memory locations stored by the collection of all values in the set).

**Details**
The goal of this PR is to optimize this function from AliasDb:

```
bool AliasDb::writesToAlias(Node* n, const ValueSet& vs) const {
  const auto writtenTo = getWrites(n);
  if (writtenTo.empty()) {
    return false;
  }

  MemoryLocations locs;
  for (const auto v : vs) {
    auto it = elementMap_.find(v);
    if (it != elementMap_.end()) {
      const auto& vlocs = memoryDAG_->getMemoryLocations(it->second);
      if (writtenTo.intersects(vlocs)) {
        return true;
      }
    }
  }

  return false;
}
```

In the DCE use case, we have a ValueSet of live values, into which we insert `Value*`s; and sometimes need to check whether a node mutates any of the live values using `writesToAlias`.

Looping through all the values in the ValueSet and indexing into the elementMap_ is slow; so if we can pre-compute the MemoryLocations set, this speeds up the function. In some large model examples, I see ~15-25x speedups from this change.

**Implementation**: To avoid exposing too many details of AliasDb, I introduce a friend class `ValueAndMemoryLocationSet`, which is an insert-only set of Values, which also maintains the corresponding MemoryLocations.

Then in AliasDb, I use `ValueAndMemoryLocationSet` if we're using AliasDb for analysis, and otherwise use a `Set<Value*>` if we don't have AliasDb.

Test Plan: Rely on unit tests.

Differential Revision: D74827086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153645
Approved by: https://github.com/eellison
2025-05-19 21:04:59 +00:00
cc48550e6f [CI] Reuse old whl (#153838)
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl.  This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused.  See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?

To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist

Pros:
* build time -> 6 min whenever this can be done

Cons:
* not sure if I have the right files
* version + whl name still remains the same

Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
2025-05-19 20:56:44 +00:00
9180bb187c Cache code generation during triton template expansion and enable it for mm_template. (#151773)
In a model, we see ~~ 40% of the time in mm/addmm tuning. The model have 2000 mm,
many of which receives the same input shapes.

with autotune enabled, this become expensive, while we already cache auto tuning results, we
did not used to cache the generation of the python code and the loading for each config that we autotune on.

This diff handles the code generation part (template expansions) a previous diff handled the loading part.
This is expected to save 20% of the model I am working on.

How do we do the caching?
For a given configurations and input layout, the generated code is always the same. One caveat is that
some other information collected during code generation are input dependent (namely depends on inputs
names and symbol names in inputs). and not just layout. !
To handle those we use a record and replay approach, where we record the functions that are called during
code generation that effect those outputs and replay them at a cache hit.

Effect on the current benchmark on a local run on dev server.
mm_loop. 24115830838 -> 18362098019
mm_loop_dynamic 30506097176-> 25697270062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151773
Approved by: https://github.com/eellison
2025-05-19 20:38:04 +00:00
1ccacc028d Revert "[CI] Reuse old whl (#153838)"
This reverts commit 0716acff3a3692daaf31d97f91ae5aee70f10f24.

Reverted https://github.com/pytorch/pytorch/pull/153838 on behalf of https://github.com/clee2000 due to forgot to comment some stuff out ([comment](https://github.com/pytorch/pytorch/pull/153838#issuecomment-2892195387))
2025-05-19 20:33:14 +00:00
6383ddcfa4 Update serialization docs (#153631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153631
Approved by: https://github.com/albanD
2025-05-19 20:22:07 +00:00
2fcbb903cb [BE][EZ] Delete unsued conda-env-IOS.txt (#153849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153849
Approved by: https://github.com/janeyx99, https://github.com/seemethere, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-05-19 20:06:38 +00:00
eqy
6ae0c42278 [CUDA][cuBLASLt] Respect allow[FP16/BF16]ReductionCuBLAS in cuBLASLt (#153095)
cuBLASLt matmuls have been silently allowing all reduction types, which meant that e.g., `allow_fp16_reduced_precision_reduction = False` had no effect.

In practice split-K with reduced precision reductions were unlikely to happen as the default `CUBLASLT_WORKSPACE_SIZE` of 1MiB tends to prevent this.

However this isn't guaranteed and we are on the path to increasing the default workspace size following #151163

This setting is effectively already tested in e.g., `test_cublas_addmm_size_100_cuda_float16` and `test_cublas_addmm_size_100_cuda_bfloat16` but the backend selection is not deterministic. Running the full `test_matmul_cuda.py` seems to exercise the Lt interface, but running a standalone test does not (apparently due to spurious alignment differences).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153095
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-05-19 20:05:37 +00:00
e581e1c0f4 [BE][Ez]: Propogate some nodiscard in RNN (#153836)
Follow up @cyyever #153805 to propagate [nodiscard] from the empty() method call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153836
Approved by: https://github.com/eqy
2025-05-19 19:55:45 +00:00
0716acff3a [CI] Reuse old whl (#153838)
~50% of commits on main only touch python files unrelated to the object files in the whl, meaning that we could reuse old whls and put the current commit's python files into the whl.  This PR does that in CI by identifying a previous job whose artifact and whls binaries can be reused.  See https://docs.google.com/document/d/1nQ1FNJqnJuSFRiM2HvQ27zg6Vm-77n7LECp30zYfTDk/edit?tab=t.icom2lesr6es for more details?

To reuse:
* the changed files between the whl's commit and the current commit can only be python files in test/ or torch/ and not in torch/csrc
* not on main branch or release branch
* ci-force-rebuild not on PR
* special abort issue is closed
* artifact should exist

Pros:
* build time -> 6 min whenever this can be done

Cons:
* not sure if I have the right files
* version + whl name still remains the same

Testing:
Unfortunately this PR's changed files are not on the list of acceptable changed files for reusing the whl, so I've been mangling it on other PRs to get things like https://github.com/pytorch/pytorch/actions/runs/15119214901/job/42497650394?pr=147470 (It is enabled on linux-focal-cuda12.6-py3.10-gcc11 / build and there are changes in common_utils.py to make sure the copying of python takes effect)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153838
Approved by: https://github.com/malfet
2025-05-19 19:26:08 +00:00
674a85cf26 Revert "[Distributed][CI] Rework continuous TestCase (#153653)"
This reverts commit 0d5c628a6e96e0a960af39d1d0de4bf04df69c39.

Reverted https://github.com/pytorch/pytorch/pull/153653 on behalf of https://github.com/kwen2501 due to More fixes needed ([comment](https://github.com/pytorch/pytorch/pull/153653#issuecomment-2891931028))
2025-05-19 18:29:27 +00:00
0d5c628a6e [Distributed][CI] Rework continuous TestCase (#153653)
1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file).

2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue.

3. Added a test template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653
Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj
2025-05-19 18:20:42 +00:00
c54b9f2969 [Monitoring] Add util for linux build (#153456)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153456
Approved by: https://github.com/huydhn
2025-05-19 17:28:17 +00:00
be36bacdaa [pytorch] Delete TorchScript based Android demo app and point user to ExecuTorch (#153767)
Summary: A retry of #153656. This time start from co-dev to make sure we capture internal signals.

Test Plan: Rely on CI jobs.

Differential Revision: D74911818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153767
Approved by: https://github.com/kirklandsign, https://github.com/cyyever, https://github.com/Skylion007
2025-05-19 17:20:36 +00:00
6487ea30b3 [c10d] Fix new_subgroups(group=) bug (#153798)
Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call.

Test Plan: contbuild & OSS CI

Differential Revision: D74964213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798
Approved by: https://github.com/Skylion007
2025-05-19 17:01:10 +00:00
b0e5402377 Revert "Recheck autotune cache on static cuda launcher load (#153565)"
This reverts commit 02af4e88e4e76309672dbc9b5970ae630df525c7.

Reverted https://github.com/pytorch/pytorch/pull/153565 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see ee72c53c88/1 ([comment](https://github.com/pytorch/pytorch/pull/153565#issuecomment-2891673913))
2025-05-19 16:52:48 +00:00
ee72c53c88 Enable ruff check for all ipynb files (#153820)
Fixes #146411, following #148654

After test, seems this could be enabled for all ipynb file.

```bash
lintrunner --take RUFF --all-files
Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153820
Approved by: https://github.com/Skylion007
2025-05-19 16:45:26 +00:00
ed5f4a4fa8 Replace size() checks with empty() (#153805)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153805
Approved by: https://github.com/nareshrajkumar866, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-05-19 16:20:57 +00:00
0ec8fe46d7 cleanup, refactor and add missing self._dde_suppressed checks (#152657)
so two things other than cleanups and refactoring
1) do not use propagate_real_tensors to resolve eval under guard_or_true/guard_or_false .
2) do not guard for dimensions of type  DimDynamic.OBLIVIOUS_SIZE under guard_or_true/guard_or_false .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152657
Approved by: https://github.com/pianpwk
2025-05-19 16:15:14 +00:00
dccd19c2ef [Inductor] Construct subgraph with benchmarking args not example_inputs (#153753)
If the inputs to a subgraph has FlexibleLayout, the subgraph does not currently freeze the layouts here. Therefore, the `example_inputs` generated might not be consistent in layout with the `args` based in for benchmarking

Differential Revision: [D74900879](https://our.internmc.facebook.com/intern/diff/D74900879/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153753
Approved by: https://github.com/eellison
2025-05-19 15:58:40 +00:00
7a46f4bde0 Enable accelerator to perform streaming backward (#153412)
Also see https://github.com/pytorch/pytorch/pull/142097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153412
Approved by: https://github.com/albanD
ghstack dependencies: #151079
2025-05-19 15:52:42 +00:00
c5cba39d46 Improve torch.ops typing (#153558)
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558
Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever
2025-05-19 14:52:32 +00:00
3cd5b3b1e7 [AOTI] Skip a rocm test (#153828)
Summary: Skip test_aot_inductor_package.test_compile_after_package. https://github.com/pytorch/pytorch/pull/150739 added an opt-in feature which doesn't work for rocm yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153828
Approved by: https://github.com/malfet
2025-05-19 14:13:19 +00:00
02af4e88e4 Recheck autotune cache on static cuda launcher load (#153565)
When loading statically launchable triton kernels from FxGraphCache, since we don't instantiate a CachingAutotuner like we do normally, we need to recheck the autotune cache based on the existing compile results. If we get a hit, we take the compile result whose config matches the best config.

Sometimes, the best config will have been from coordinate descent tuning. In this case, FxGraphCache today does not cache the resulting triton kernel, neither with static or without static cuda launcher. This is because coordinate descent tuning happens at runtime, and if the best config happens to not be one of the precompiled configs.

Test Plan:
New unit test that failed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153565
Approved by: https://github.com/aorenste
2025-05-19 12:50:22 +00:00
c45515c2ed Update slow tests (#153815)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153815
Approved by: https://github.com/pytorchbot
2025-05-19 11:15:25 +00:00
4f1a52fba4 [xla hash update] update the pinned xla hash (#153816)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153816
Approved by: https://github.com/pytorchbot
2025-05-19 11:05:51 +00:00
f3daedb263 [BE]: Remove redundant copy (#153629)
Add typing and remove redundant copy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153629
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-05-19 08:25:20 +00:00
5506baa4ed Refactoring FSDP2 (_composable/fsdp) test cases to be device agnostic (#149848)
The motivation for this PR is refactor existing test cases in the folder test/distributed/_composable/fsdp/ or fsdp2(as referred to in torch titan) to be device agnostic such that any accelerator type is supported (for eg. CUDA, HPU, XPU etc)

The changes are in line with previously merged changes for fsdp (present in the folder test/distributed/fsdp/ ) test cases: https://github.com/pytorch/pytorch/pull/139184/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149848
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2025-05-19 05:46:51 +00:00
6f835a4769 [amd] fix tunableop gemm (#153764)
Summary: Tunableop on AMD has perf regression for a while. It turns out that the tunableop code path will first run tuned GEMM and then run heuristics GEMM (so run two GEMMs...)....

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 buck test @//mode/opt-amd-gpu -c fbcode.rocm_arch=mi300 -c fbcode.rocm_ck_rtz=true fbcode//accelerators/workloads/microbench/RE:test_emu_v1p4 -- --exact 'accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)' --run-disabled
```

Before the diff
```
  File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/ecc11ed52295855f/accelerators/workloads/microbench/RE/__test_emu_v1p4__/test_emu_v1p4#link-tree/accelerators/workloads/microbench/RE/test_emu_v1p4.py", line 47, in test_gemm
    self.assertTrue(result < AMD_GEMM_BASELINE * AMD_GEMM_THRESHOLD)

Buck UI: https://www.internalfb.com/buck2/b4b8dfca-0301-4c5d-83d6-d866d840c42d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14355223896396807
Network: Up: 10MiB  Down: 1.9GiB  (reSessionID-23b213fe-a460-4788-86c6-a52343ff10f4)
Loading targets.   Remaining      0/5144                                      93161 dirs read, 753263 targets declared
Analyzing targets. Remaining      0/70523                                     2837379 actions, 3262810 artifacts declared
Executing actions. Remaining      0/472286                                    217:26:58.1s exec time total
Command: test.     Finished 122 local, 522 remote, 199785 cache (99% hit)     211:26:30.5s exec time cached (97%)
Time elapsed: 12:50.2s
Test execution completed but the tests failed
Tests finished: Pass 0. Fail 1. Fatal 0. Skip 0. Build failure 0
1 TESTS FAILED
  ✗ accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)

Run $ fdb buck test <args> to debug accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)
      ^^^ just prefix your previous command! ($ fdb !!)
Learn more at https://fburl.com/fdb
```

After the diff
```
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Reviewed By: henryoier, henryhu6

Differential Revision: D74910115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153764
Approved by: https://github.com/yangsiyu007, https://github.com/xw285cornell
2025-05-19 04:07:48 +00:00
2ade886412 [XPU] [Windows] Auto turn on kineto XPU build when compiler version support. (#153681)
Since SYCL compiler 20250101, it will remove dependency of level zero header. We can turn on kineto XPU by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153681
Approved by: https://github.com/chuanqi129, https://github.com/cyyever, https://github.com/EikanWang
2025-05-19 03:07:14 +00:00
1bc5762495 [Intel GPU][Inductor] Fallback embedding_dense_backward on XPU (#151637)
Reopen #146888, now the modification only affects xpu device. We do not  want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151637
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
2025-05-19 02:19:37 +00:00
74d0300804 Change unsafe_marked_cacheable_functions to a dictionary, so that you can specify a static cache key (#152486)
Fixes https://github.com/pytorch/pytorch/issues/152434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152486
Approved by: https://github.com/oulgen
2025-05-19 02:16:33 +00:00
694748dd9d [MPSInductor] Fix conv_transpose channels last (#153787)
Regardless of the input layout, transposed convolution always returns contiguous tensor on MPS
Add test to validate that
This fixes torch.compile for SegmentAnything network

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153787
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #153786
2025-05-19 02:01:48 +00:00
6fe5d9215f [EZ][MPS] Enable rsub op (#153786)
Nothing really to enable, just add it to native functions, TensorIterator abstraction takes care of the rest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153786
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/dcci
2025-05-19 02:01:48 +00:00
a2d0ef242d [AOTI] Embed cubin files into .so (#150739)
Summary: Embed cubin files so AOTI is one step closer to generate a single binary. Controlled by a flag and off as default.

Differential Revision: [D72535357](https://our.internmc.facebook.com/intern/diff/D72535357)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150739
Approved by: https://github.com/angelayi
2025-05-19 01:11:46 +00:00
cyy
a8986963da Fix some CMake issues (#153686)
These issues were discovered when trying CMake 3.27:
1. set C++ language on HIP sources.
2. add missing link to gtest_main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153686
Approved by: https://github.com/Skylion007
2025-05-19 00:31:34 +00:00
75eb2f3ff6 Revert "[Dynamo] added warning message for tracing lru_cache wrapped functions (#153744)"
This reverts commit aac30ef50366b03f0ef2d1e770f45a3465f6ea66.

Reverted https://github.com/pytorch/pytorch/pull/153744 on behalf of https://github.com/jeanschmidt due to Need to revert as it is breaking internal signals: [D74935585](https://www.internalfb.com/diff/D74935585) ([comment](https://github.com/pytorch/pytorch/pull/153744#issuecomment-2889187038))
2025-05-18 20:13:00 +00:00
cb57b19c3a [ATen-CPU] Use math.h for GeLU as well as cmath (#153742)
Summary:
## Context

See https://github.com/pytorch/pytorch/pull/149164 for more context.

Originally, this fix worked but more recently including `cmath` by itself no longer provides access to math constants on Windows platforms. I found that including `math.h` resolves this.

I'm not sure exactly what changed, but this PR updates the header to just use both includes fix the symbols not being found. It might be a bug with a recent Windows update perhaps?

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153742
Approved by: https://github.com/swolchok, https://github.com/Skylion007
2025-05-18 19:06:45 +00:00
aa84c037f0 FakeTensorMode dispatch shouldn't include bypass in exception context (#153780)
In the FakeTensor cache when we get a bypass exception while computing the cache key (call this exc_1) we need to dispatch to the original operation.

It's possible for the dispatch to the original operation to get its own exception which we want to bubble up to the caller (call this exc_2).

If we directly dispatch from within the handler for exc_1 then exc_2 will have a `__context__` of exc_1 - which can cause deviations between cached and non-cached behavior - so we need to be a bit careful when we call the dispatch.

Testing:
test_aotdispatch.py::TestAOTExport::test_aot_export_predispatch_outdtype fails before this change and passes after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153780
Approved by: https://github.com/oulgen
2025-05-18 17:21:46 +00:00
68034198e5 [HOP] Mutation and alias rework (#146658)
This PR reworks the way the input mutations and various aliases are checked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146658
Approved by: https://github.com/ydwu4
2025-05-18 08:05:22 +00:00
0e805aad7f [ONNX] Support float4 (#151069)
- Support exporting float4 models (note: currently we use IR version 10 universally in the exporter, which does not include float 4 support. Eventually when onnx runtime and the ecosystem moves to support the new IR version 11 we should bump our version to 11 in the exporter as well)
- The shape of the type is set according to https://github.com/pytorch/pytorch/pull/148791#discussion_r2038704986 (added last dim with size 2)
- Use ml_dtypes types when converting to numpy for consistency with ONNX IR

Fix https://github.com/pytorch/pytorch/issues/150202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151069
Approved by: https://github.com/titaiwangms
2025-05-18 03:19:35 +00:00
8568dbce1d [inductor] Clean typing in codegen/common.py and codecache.py (#150767)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150767
Approved by: https://github.com/aorenste
2025-05-17 13:56:50 +00:00
27f7b65a69 [BE] Ensure generated stub files by gen_pyi are properly formatted (#150730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150730
Approved by: https://github.com/aorenste
2025-05-17 12:30:40 +00:00
7ebea09986 [Cutlass] Enable fusion with FusedSchedulerNodes (#153588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153588
Approved by: https://github.com/eellison
ghstack dependencies: #152815
2025-05-17 12:29:10 +00:00
f604732e2e [Cutlass] E2E Tests for EVT (#152815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152815
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-05-17 12:29:10 +00:00
b4fb801b2d [export] Move PT2 constants to torch::_export (#153206)
Test Plan:
`buck2 test //sigmoid/...`
https://www.internalfb.com/intern/testinfra/testrun/1970325119807758

Differential Revision: D74417085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153206
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-05-17 08:21:59 +00:00
40339c1e99 Revert "[CUDA][cuBLAS][cuBLASLt] avoid polluting prefer cuBLAS/Lt setting across tests (#153655)"
This reverts commit 3bde364996d53571a9fb799f5951a203a352ed18.

Reverted https://github.com/pytorch/pytorch/pull/153655 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/153655#issuecomment-2888212597))
2025-05-17 08:11:54 +00:00
9b2a45ac7d Refactor torch/utils/data/datapipes/gen_pyi.py with torchgen (#150626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150626
Approved by: https://github.com/aorenste
2025-05-17 06:21:41 +00:00
eqy
e802b29ed4 [SDPA][EZ] Abate narrowing conversion warning spam in flash_api.cpp (#153643)
for messages like
```/workspace/pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp:1396:38: warning: narrowing conversion of ‘(char)(& q)->at::Tensor::<anonymous>.at::TensorBase::get_device()’ from ‘char’ to ‘c10::DeviceIndex’ {aka ‘signed ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153643
Approved by: https://github.com/Skylion007
2025-05-17 02:07:35 +00:00
945 changed files with 28838 additions and 10325 deletions

View File

@ -31,33 +31,47 @@ def build_ArmComputeLibrary() -> None:
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = "ComputeLibrary"
os.makedirs(acl_install_dir)
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")
if os.path.isdir(acl_install_dir):
shutil.rmtree(acl_install_dir)
if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", "-j8", f"build_dir=/{acl_install_dir}/build"]
+ acl_build_flags,
["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src"]:
for d in ["arm_compute", "include", "utils", "support", "src", "build"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def update_wheel(wheel_path, desired_cuda) -> None:
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith("Tag:"):
lines[i] = line.replace("-linux_", "-manylinux_2_28_")
print(f"Updated tag from {line} to {lines[i]}")
break
with open(filename, "w") as f:
f.writelines(lines)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Update the cuda wheel libraries
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
@ -88,30 +102,19 @@ def update_wheel(wheel_path, desired_cuda) -> None:
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if enable_cuda:
if "128" in desired_cuda:
libs_to_copy += [
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
"/opt/OpenBLAS/lib/libopenblas.so.0",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
@ -120,6 +123,13 @@ def update_wheel(wheel_path, desired_cuda) -> None:
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
if f.is_dir() and f.name.endswith(".dist-info"):
replace_tag(f"{f.path}/WHEEL")
break
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
@ -194,8 +204,10 @@ if __name__ == "__main__":
).decode()
print("Building PyTorch wheel")
build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
os.system("cd /pytorch; python setup.py clean")
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars = "MAX_JOBS=5 " + build_vars
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
@ -242,6 +254,6 @@ if __name__ == "__main__":
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
update_wheel(wheel_path, desired_cuda)
package_cuda_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

View File

@ -1,7 +1,7 @@
ARG CUDA_VERSION=12.4
ARG BASE_TARGET=cuda${CUDA_VERSION}
ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete
FROM amd64/almalinux:8 as base
FROM amd64/almalinux:8.10-20250519 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
@ -11,6 +11,8 @@ ARG DEVTOOLSET_VERSION=11
RUN yum -y update
RUN yum -y install epel-release
# install glibc-langpack-en make sure en_US.UTF-8 locale is available
RUN yum -y install glibc-langpack-en
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'

View File

@ -109,8 +109,8 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -121,30 +121,6 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
@ -156,8 +132,8 @@ case "$tag" in
UCC_COMMIT=${_UCC_COMMIT}
TRITON=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -168,8 +144,8 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
@ -180,8 +156,8 @@ case "$tag" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9

View File

@ -1 +1 @@
0bcc8265e677e5321606a3311bf71470f14456a8
b0e26b7359c147b8aa0af686c20510fb9b15990a

View File

@ -1 +1 @@
96316ce50fade7e209553aba4898cd9b82aab83b
c8757738a7418249896224430ce84888e8ecdd79

View File

@ -183,9 +183,9 @@ function prune_126 {
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
# install CUDA 12.8.0 in the same container
install_cuda 12.8.0 cuda_12.8.0_570.86.10_linux
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.6.3"
# install CUDA 12.8.1 in the same container
install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION

View File

@ -31,8 +31,7 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.17.0
pip_install onnxscript==0.2.2 --no-deps
pip_install onnxscript==0.2.6 --no-deps
# required by onnxscript
pip_install ml_dtypes

View File

@ -5,7 +5,9 @@ ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
ARG DEVTOOLSET_VERSION=13
# there is a bugfix in gcc >= 14 for precompiled headers and s390x vectorization interaction.
# with earlier gcc versions test/inductor/test_cpu_cpp_wrapper.py will fail.
ARG DEVTOOLSET_VERSION=14
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
@ -58,7 +60,8 @@ RUN yum install -y \
libxslt-devel \
libxml2-devel \
openssl-devel \
valgrind
valgrind \
ninja-build
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
@ -103,9 +106,6 @@ CMD ["/bin/bash"]
# install test dependencies:
# - grpcio requires system openssl, bundled crypto fails to build
RUN dnf install -y \
protobuf-devel \
protobuf-c-devel \
protobuf-lite-devel \
hdf5-devel \
python3-h5py \
git
@ -129,6 +129,9 @@ RUN pip3 install flatbuffers && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \
git submodule update --init --recursive && \
./build.sh --config Release --parallel 0 --enable_pybind --build_wheel --enable_training --enable_training_apis --enable_training_ops --skip_tests --allow_running_as_root && \
./build.sh --config Release --parallel 0 --enable_pybind \
--build_wheel --enable_training --enable_training_apis \
--enable_training_ops --skip_tests --allow_running_as_root \
--compile_no_warning_as_error && \
pip3 install ./build/Linux/Release/dist/onnxruntime_training-*.whl && \
cd .. && /bin/rm -rf ./onnxruntime

View File

@ -93,7 +93,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.14.0
mypy==1.15.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.14.0
@ -166,10 +166,10 @@ pillow==11.0.0
#Pinned versions: 10.3.0
#test that import:
protobuf==3.20.2
#Description: Googles data interchange format
#Pinned versions: 3.20.1
#test that import: test_tensorboard.py
protobuf==5.29.4
#Description: Google's data interchange format
#Pinned versions: 5.29.4
#test that import: test_tensorboard.py, test/onnx/*
psutil
#Description: information on running processes and system utilization
@ -337,12 +337,12 @@ sympy==1.13.3
#Pinned versions:
#test that import:
onnx==1.17.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
onnx==1.18.0
#Description: Required by onnx tests, and mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.2.2
onnxscript==0.2.6
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:

View File

@ -15,6 +15,10 @@ sphinxext-opengraph==0.9.1
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.9.1
sphinx_sitemap==2.6.0
#Description: This is used to generate sitemap for PyTorch docs
#Pinned versions: 2.6.0
matplotlib==3.5.3
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3

View File

@ -1 +1 @@
3.3.0
3.3.1

View File

@ -18,12 +18,10 @@ retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
PLATFORM="manylinux2014_x86_64"
PLATFORM=""
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
PLATFORM="manylinux_2_28_x86_64"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
@ -36,6 +34,9 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
retry apt-get update
retry apt-get -y install zip openssl
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
# We use the package name to test the package by passing this to 'pip install'
@ -79,8 +80,6 @@ if [[ -e /opt/openssl ]]; then
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
mkdir -p /tmp/$WHEELHOUSE_DIR
export PATCHELF_BIN=/usr/local/bin/patchelf

View File

@ -36,10 +36,8 @@ if [[ -n "$DESIRED_CUDA" ]]; then
if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then
CUDA_VERSION=${DESIRED_CUDA}
else
# cu90, cu92, cu100, cu101
if [[ ${#DESIRED_CUDA} -eq 4 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"
elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then
# cu126, cu128 etc...
if [[ ${#DESIRED_CUDA} -eq 5 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"
fi
fi
@ -61,10 +59,6 @@ case ${CUDA_VERSION} in
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
@ -91,14 +85,15 @@ fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
DEPS_LIST=(
@ -108,26 +103,8 @@ DEPS_SONAME=(
"libgomp.so.1"
)
# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary
# since nvidia-cusparselt-cu11 is not available in PYPI
if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then
export USE_CUFILE=0
fi
# CUDA_VERSION 12.4, 12.6, 12.8
# CUDA_VERSION 12.6, 12.8
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
@ -151,6 +128,8 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
@ -168,17 +147,9 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libcufile.so.0"
"libcufile_rdma.so.1"
)
if [[ $USE_CUFILE == 1 ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcufile.so.0"
"libcufile_rdma.so.1"
)
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
@ -194,12 +165,8 @@ if [[ $CUDA_VERSION == 12* ]]; then
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
'$ORIGIN/../../nvidia/cufile/lib'
)
if [[ $USE_CUFILE == 1 ]]; then
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
@ -214,11 +181,25 @@ if [[ $CUDA_VERSION == 12* ]]; then
fi
elif [[ $CUDA_VERSION == "11.8" ]]; then
export USE_STATIC_CUDNN=0
# Turn USE_CUFILE off for CUDA 11.8 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
export USE_CUFILE=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750
export BUILD_BUNDLE_PTXAS=1
# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary
# since nvidia-cusparselt-cu11 is not available in PYPI
if [[ $USE_CUSPARSELT == "1" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(

View File

@ -22,9 +22,7 @@ retry () {
# TODO move this into the Docker images
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
@ -35,6 +33,9 @@ elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
else
echo "Unknown OS: '$OS_NAME'"
exit 1
fi
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if

View File

@ -40,7 +40,7 @@ if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
else
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
fi
if which sccache > /dev/null; then
print_sccache_stats

View File

@ -20,14 +20,4 @@ print_cmake_info() {
CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")
# Print all libraries under cmake rpath for debugging
ls -la "$CONDA_INSTALLATION_DIR/../lib"
export CMAKE_EXEC
# Explicitly add conda env lib folder to cmake rpath to address the flaky issue
# where cmake dependencies couldn't be found. This seems to point to how conda
# links $CMAKE_EXEC to its package cache when cloning a new environment
install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true
# Adding the rpath will invalidate cmake signature, so signing it again here
# to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
# with an exit code 137 otherwise
codesign -f -s - "${CMAKE_EXEC}" || true
}

View File

@ -165,6 +165,7 @@ test_jit_hooks() {
torchbench_setup_macos() {
git clone --recursive https://github.com/pytorch/vision torchvision
git clone --recursive https://github.com/pytorch/audio torchaudio
brew install jpeg-turbo libpng
pushd torchvision
git fetch
@ -179,7 +180,8 @@ torchbench_setup_macos() {
git checkout "$(cat ../.github/ci_commit_pins/audio.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
#TODO: Remove me, when figure out how to make TorchAudio find brew installed openmp
USE_OPENMP=0 python setup.py develop
popd
# Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120
@ -187,9 +189,8 @@ torchbench_setup_macos() {
checkout_install_torchbench
}
conda_benchmark_deps() {
conda install -y astunparse numpy scipy ninja pyyaml setuptools cmake typing-extensions requests protobuf numba cython scikit-learn
conda install -y -c conda-forge librosa
pip_benchmark_deps() {
python -mpip install --no-input astunparse requests cython scikit-learn
}
@ -197,7 +198,7 @@ test_torchbench_perf() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -224,7 +225,7 @@ test_torchbench_smoketest() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
pip_benchmark_deps
# shellcheck disable=SC2119,SC2120
torchbench_setup_macos
@ -232,8 +233,8 @@ test_torchbench_smoketest() {
mkdir -p "$TEST_REPORTS_DIR"
local device=mps
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo)
local hf_models=(GoogleFnet YituTechConvBert)
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor)
local hf_models=(GoogleFnet YituTechConvBert Speech2Text2ForCausalLM)
for backend in eager inductor; do
@ -258,10 +259,10 @@ test_torchbench_smoketest() {
if [ "$backend" == "inductor" ]; then
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv" || true
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_performance.csv" || true
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/huggingface.py \
--accuracy --only "$model" --backend "$backend" --inference --devices "$device" "$dtype_arg" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_accuracy.csv" || true
--output "$TEST_REPORTS_DIR/inductor_${backend}_huggingface_${dtype}_inference_${device}_accuracy.csv" || true
fi
done
done
@ -289,7 +290,7 @@ test_hf_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
echo "Launching HuggingFace training perf run"
@ -305,7 +306,7 @@ test_timm_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
pip_benchmark_deps
torchbench_setup_macos
echo "Launching timm training perf run"

View File

@ -820,16 +820,7 @@ test_inductor_torchbench_smoketest_perf() {
done
}
test_inductor_get_core_number() {
if [[ "${TEST_CONFIG}" == *aarch64* ]]; then
echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"
else
echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"
fi
}
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"
export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
@ -841,14 +832,23 @@ test_inductor_set_cpu_affinity(){
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
fi
cores=$(test_inductor_get_core_number)
# Set number of cores to 16 on Aarch64 for performance runs.
# Use nproc here instead of lscpu because it takes into account cgroups slice
cpus=$(nproc)
thread_per_core=$(lscpu | grep 'Thread(s) per core:' | awk '{print $4}')
cores=$((cpus / thread_per_core))
# Set number of cores to 16 on aarch64 for performance runs
if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then
cores=16
fi
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
export TASKSET="taskset -c 0-$end_core"
# Handle cgroups slice start and end CPU
start_cpu=$(python -c 'import os; print(min(os.sched_getaffinity(0)))')
# Leaving one physical CPU for other tasks
end_cpu=$(($(python -c 'import os; print(max(os.sched_getaffinity(0)))') - thread_per_core))
export TASKSET="taskset -c $start_cpu-$end_cpu"
}
test_inductor_torchbench_cpu_smoketest_perf(){

View File

@ -38,7 +38,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 pytest-subtests==0.13.1
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.12.2.0

View File

@ -7,7 +7,7 @@ if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
cd %DEPENDENCIES_DIR%

View File

@ -7,7 +7,7 @@ if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: Clone OpenBLAS

View File

@ -2,7 +2,7 @@
cd %PYTORCH_ROOT%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: create virtual environment

View File

@ -21,7 +21,7 @@ if %ENABLE_APL% == 1 (
)
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: change to source directory

View File

@ -21,7 +21,7 @@ if %ENABLE_APL% == 1 (
)
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: change to source directory

View File

@ -33,7 +33,7 @@ pushd tmp
set VC_VERSION_LOWER=14
set VC_VERSION_UPPER=36
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
set install_root=%CD%
set INCLUDE=%INCLUDE%;%install_root%\include;%install_root%\include\torch\csrc\api\include

View File

@ -3,6 +3,8 @@ if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
:: Please don't delete VS2019 as an alternative, in case some Windows compiler issue.
:: Reference: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2858693930
if "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17

View File

@ -9,7 +9,7 @@ if [[ "$OS" != "windows-arm64" ]]; then
export USE_SCCACHE=1
export SCCACHE_BUCKET=ossci-compiler-cache
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
export VC_YEAR=2019
export VC_YEAR=2022
fi
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

View File

@ -4,7 +4,7 @@ set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2019
export VC_YEAR=2022
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022

View File

@ -5,7 +5,7 @@ title: "DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]"
labels: "module: ci"
---
> For example, DISABLED pull / win-vs2019-cpu-py3 / test (default). Once
> For example, DISABLED pull / win-vs2022-cpu-py3 / test (default). Once
> created, the job will be disabled within 15 minutes. You can check the
> list of disabled jobs at https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json

View File

@ -0,0 +1,111 @@
name: 🚀 Release highlight for proposed Feature
description: Submit a Release highlight for proposed Feature
labels: ["release-feature-request"]
body:
- type: textarea
attributes:
label: Release highlight for proposed Feature
description: >
Example: “A torch.special module, analogous to SciPy's special module.”
- type: input
id: contact
attributes:
label: Point(s) of contact
description: How can we get in touch with you if we need more info?
placeholder: ex. github username
validations:
required: false
- type: dropdown
attributes:
label: Release Mode (pytorch/pytorch features only)
description: |
If "out-of-tree", please include the GH repo name
options:
- In-tree
- Out-of-tree
validations:
required: true
- type: textarea
attributes:
label: Out-Of-Tree Repo
description: >
please include the GH repo name
validations:
required: false
- type: textarea
attributes:
label: Description and value to the user
description: >
Please provide a brief description of the feature and how it will benefit the user.
validations:
required: false
- type: textarea
attributes:
label: Link to design doc, GitHub issues, past submissions, etc
validations:
required: false
- type: textarea
attributes:
label: What feedback adopters have provided
description: >
Please list users/teams that have tried the feature and provided feedback. If that feedback motivated material changes (API, doc, etc..), a quick overview of the changes and the status (planned, in progress, implemented) would be helpful as well.
validations:
required: false
- type: dropdown
attributes:
label: Plan for documentations / tutorials
description: |
Select One of the following options
options:
- Tutorial exists
- Will submit a PR to pytorch/tutorials
- Will submit a PR to a repo
- Tutorial is not needed
validations:
required: true
- type: textarea
attributes:
label: Additional context for tutorials
description: >
Please provide a link for existing tutorial or link to a repo or context for why tutorial is not needed.
validations:
required: false
- type: dropdown
attributes:
label: Marketing/Blog Coverage
description: |
Are you requesting feature Inclusion in the release blogs?
options:
- "Yes"
- "No"
validations:
required: true
- type: textarea
attributes:
label: Are you requesting other marketing assistance with this feature?
description: >
E.g. supplementary blogs, social media amplification, etc.
validations:
required: false
- type: textarea
attributes:
label: Release Version
description: >
Please include release version for marketing coverage.
validations:
required: false
- type: textarea
attributes:
label: OS / Platform / Compute Coverage
description: >
Please list the platforms supported by the proposed feature. If the feature supports all the platforms, write "all". Goal of this section is to clearly share if this feature works in all PyTorch configurations or is it limited to only certain platforms/configurations (e.g. CPU only, GPU only, Linux only, etc...)
validations:
required: false
- type: textarea
attributes:
label: Testing Support (CI, test cases, etc..)
description: >
Please provide an overview of test coverage. This includes unit testing and integration testing, but if E2E validation testing has been done to show that the feature works for a certain set of use cases or models please mention that as well.
validations:
required: false

View File

@ -45,6 +45,7 @@ self-hosted-runner:
- windows.g5.4xlarge.nvidia.gpu
# Windows ARM64 runners
- windows-11-arm64
- windows-11-arm64-preview
# Organization-wide AMD-hosted runners
# MI2xx runners
- linux.rocm.gpu

View File

@ -0,0 +1,38 @@
name: Reuse old wheel if possible
description:
Reuse old wheel if possible
inputs:
build-environment:
description: Build environment
required: true
run-id:
description: Workflow run ID
required: true
github-token:
description: GitHub token
required: true
outputs:
reuse:
description: Whether the wheel is reused or not
value: ${{ steps.check-file-changes.outputs.reuse }}
runs:
using: composite
steps:
# Check out pytorch with fetch depth 0
- name: Check file changes
id: check-file-changes
shell: bash
continue-on-error: true
env:
GITHUB_TOKEN: ${{ inputs.github-token }}
run: |
set -x
python3 ${GITHUB_ACTION_PATH}/reuse_old_whl.py \
--build-environment "${{ inputs.build-environment }}" \
--run-id "${{ inputs.run-id }}" \
--github-ref "${{ github.ref }}"

View File

@ -0,0 +1,289 @@
import argparse
import os
import subprocess
from functools import lru_cache
from pathlib import Path
from typing import Any, cast, Optional
import requests
FORCE_REBUILD_LABEL = "ci-force-rebuild"
@lru_cache
def get_merge_base() -> str:
merge_base = subprocess.check_output(
["git", "merge-base", "HEAD", "origin/main"],
text=True,
stderr=subprocess.DEVNULL,
).strip()
# Remove this when we turn this off for the main branch
if merge_base == get_head_sha():
print("Merge base is the same as HEAD, using HEAD^")
merge_base = subprocess.check_output(
["git", "rev-parse", "HEAD^"],
text=True,
stderr=subprocess.DEVNULL,
).strip()
print(f"Merge base: {merge_base}")
return merge_base
@lru_cache
def get_head_sha() -> str:
sha = subprocess.check_output(
["git", "rev-parse", "HEAD"],
text=True,
stderr=subprocess.DEVNULL,
).strip()
return sha
def is_main_branch() -> bool:
return False
# Testing on main branch for now
# print(
# f"Checking if we are on main branch: merge base {get_merge_base()}, head {get_head_sha()}"
# )
# return get_merge_base() == get_head_sha()
def query_github_api(url: str) -> Any:
headers = {
"Accept": "application/vnd.github.v3+json",
"Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}",
}
response = requests.get(url, headers=headers)
return response.json()
@lru_cache
def check_labels_for_pr() -> bool:
# Check if the current commit is part of a PR and if it has the
# FORCE_REBUILD_LABEL
head_sha = get_head_sha()
url = f"https://api.github.com/repos/pytorch/pytorch/commits/{head_sha}/pulls"
response = query_github_api(url)
print(
f"Found {len(response)} PRs for commit {head_sha}: {[pr['number'] for pr in response]}"
)
for pr in response:
labels = pr.get("labels", [])
for label in labels:
if label["name"] == FORCE_REBUILD_LABEL:
print(f"Found label {FORCE_REBUILD_LABEL} in PR {pr['number']}.")
return True
return False
def check_issue_open() -> bool:
# Check if issue #153759 is open. This is the config issue for quickly
# forcing everyone to build
url = "https://api.github.com/repos/pytorch/pytorch/issues/153759"
response = query_github_api(url)
if response.get("state") == "open":
print("Issue #153759 is open.")
return True
else:
print("Issue #153759 is not open.")
return False
def get_workflow_id(run_id: str) -> Optional[str]:
# Get the workflow ID that corresponds to the file for the run ID
url = f"https://api.github.com/repos/pytorch/pytorch/actions/runs/{run_id}"
response = query_github_api(url)
if "workflow_id" in response:
print(f"Found workflow ID for run ID {run_id}: {response['workflow_id']}")
return cast(str, response["workflow_id"])
else:
print("No workflow ID found.")
return None
def ok_changed_file(file: str) -> bool:
# Return true if the file is in the list of allowed files to be changed to
# reuse the old whl
if (
file.startswith("torch/")
and file.endswith(".py")
and not file.startswith("torch/csrc/")
):
return True
if file.startswith("test/") and file.endswith(".py"):
return True
return False
def check_changed_files(sha: str) -> bool:
# Return true if all the changed files are in the list of allowed files to
# be changed to reuse the old whl
changed_files = (
subprocess.check_output(
["git", "diff", "--name-only", sha, "HEAD"],
text=True,
stderr=subprocess.DEVNULL,
)
.strip()
.split()
)
print(f"Checking changed files between {sha} and HEAD:")
for file in changed_files:
if not ok_changed_file(file):
print(f" File {file} is not allowed to be changed.")
return False
else:
print(f" File {file} is allowed to be changed.")
return True
def find_old_whl(workflow_id: str, build_environment: str, sha: str) -> bool:
# Find the old whl on s3 and download it to artifacts.zip
if build_environment is None:
print("BUILD_ENVIRONMENT is not set.")
return False
print(f"SHA: {sha}, workflow_id: {workflow_id}")
workflow_runs = query_github_api(
f"https://api.github.com/repos/pytorch/pytorch/actions/workflows/{workflow_id}/runs?head_sha={sha}&branch=main&per_page=100"
)
if workflow_runs.get("total_count", 0) == 0:
print("No workflow runs found.")
return False
for run in workflow_runs.get("workflow_runs", []):
# Look in s3 for the old whl
run_id = run["id"]
try:
url = f"https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/{run_id}/{build_environment}/artifacts.zip"
print(f"Checking for old whl at {url}")
response = requests.get(
url,
)
if response.status_code == 200:
with open("artifacts.zip", "wb") as f:
f.write(response.content)
print(f"Found old whl file from s3: {url}")
return True
except requests.RequestException as e:
print(f"Error checking for old whl: {e}")
continue
return False
def unzip_artifact_and_replace_files() -> None:
# Unzip the artifact and replace files
subprocess.check_output(
["unzip", "-o", "artifacts.zip", "-d", "artifacts"],
)
os.remove("artifacts.zip")
# Rename wheel into zip
wheel_path = Path("artifacts/dist").glob("*.whl")
for path in wheel_path:
new_path = path.with_suffix(".zip")
os.rename(path, new_path)
print(f"Renamed {path} to {new_path}")
print(new_path.stem)
# Unzip the wheel
subprocess.check_output(
["unzip", "-o", new_path, "-d", f"artifacts/dist/{new_path.stem}"],
)
# Copy python files into the artifact
subprocess.check_output(
["rsync", "-avz", "torch", f"artifacts/dist/{new_path.stem}"],
)
# Zip the wheel back
subprocess.check_output(
["zip", "-r", f"{new_path.stem}.zip", "."],
cwd=f"artifacts/dist/{new_path.stem}",
)
subprocess.check_output(
[
"mv",
f"artifacts/dist/{new_path.stem}/{new_path.stem}.zip",
f"artifacts/dist/{new_path.stem}.whl",
],
)
# Remove the extracted folder
subprocess.check_output(
["rm", "-rf", f"artifacts/dist/{new_path.stem}"],
)
# Rezip the artifact
subprocess.check_output(["zip", "-r", "artifacts.zip", "."], cwd="artifacts")
subprocess.check_output(
["mv", "artifacts/artifacts.zip", "."],
)
return None
def set_output() -> None:
# Disable for now so we can monitor first
# pass
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print("reuse=true", file=env)
else:
print("::set-output name=reuse::true")
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Check for old whl files.")
parser.add_argument("--run-id", type=str, required=True, help="Workflow ID")
parser.add_argument(
"--build-environment", type=str, required=True, help="Build environment"
)
parser.add_argument(
"--github-ref",
type=str,
)
return parser.parse_args()
def can_reuse_whl(args: argparse.Namespace) -> bool:
# if is_main_branch() or (
# args.github_ref
# and any(
# args.github_ref.startswith(x)
# for x in ["refs/heads/release", "refs/tags/v", "refs/heads/main"]
# )
# ):
# print("On main branch or release branch, rebuild whl")
# return False
if check_labels_for_pr():
print(f"Found {FORCE_REBUILD_LABEL} label on PR, rebuild whl")
return False
if check_issue_open():
print("Issue #153759 is open, rebuild whl")
return False
if not check_changed_files(get_merge_base()):
print("Cannot use old whl due to the changed files, rebuild whl")
return False
workflow_id = get_workflow_id(args.run_id)
if workflow_id is None:
print("No workflow ID found, rebuild whl")
return False
if not find_old_whl(workflow_id, args.build_environment, get_merge_base()):
print("No old whl found, rebuild whl")
# TODO: go backwards from merge base to find more runs
return False
return True
if __name__ == "__main__":
args = parse_args()
if can_reuse_whl(args):
print("Reusing old whl")
unzip_artifact_and_replace_files()
set_output()

View File

@ -1,6 +1,6 @@
name: upload-utilization-stats
description: Upload utilization stats to artifacts
description: Upload utilization stats to artifacts.
inputs:
workflow_run_id:
@ -23,6 +23,17 @@ inputs:
type: string
description: 'the job name of the test'
required: True
local_path:
type: string
description: 'the local path to the utilization stats file'
required: False
default: ''
artifact_prefix:
type: string
description: |
'the prefix of the raw utilization data, for data stored in zip file, this is the prefix of the parent zip file'
default: ""
required: False
runs:
using: composite
@ -35,6 +46,8 @@ runs:
echo "workflow_Name: ${{inputs.workflow_name}}"
echo "job_id: ${{inputs.job_id}}"
echo "job_name: ${{inputs.job_name}}"
echo "artifact_prefix: ${{inputs.artifact_prefix}}"
python3 --version
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
with:
@ -53,4 +66,6 @@ runs:
--workflow-name "${{inputs.workflow_name}}" \
--workflow-run-attempt "${{inputs.workflow_attempt}}" \
--job-id "${{inputs.job_id}}" \
--job-name "${{inputs.job_name}}"
--job-name "${{inputs.job_name}}" \
--local-path "${{inputs.local_path}}" \
--artifact-prefix "${{inputs.artifact_prefix}}"

View File

@ -1 +1 @@
ea5de17755d657508c84c4dce8970b614008adcf
1a8f6213b0b61efc6a4862bc45b853551a93dbb6

View File

@ -1 +1 @@
373ffb19dc470f4423a3176a4133f8f4b3cdb5bd
e03a63be43e33596f7f0a43b0f530353785e4a59

View File

@ -1 +1 @@
d23a6e1664d20707c11781299611436e1f0c104f
966da7e46f65d6d49df3e31214470a4fe5cc8e66

View File

@ -1 +1 @@
8d9e34b352af09c81ff8df448fd27f9c4aae1382
edc1a882d872dd7f1362e4312fd045a1d81b3355

View File

@ -398,7 +398,10 @@
- torch/_inductor/codegen/cpp_micro_gemm.py
- torch/_inductor/codegen/cpp_template_kernel.py
- torch/_inductor/codegen/cpp_template.py
- torch/_inductor/codegen/cpp_bmm_template.py
- torch/_inductor/codegen/cpp_gemm_template.py
- torch/_inductor/codegen/cpp_grouped_gemm_template.py
- torch/_inductor/codegen/cpp_flex_attention_template.py
- torch/csrc/inductor/cpp_prefix.h
- test/inductor/test_mkldnn_pattern_matcher.py
- test/inductor/test_cpu_repro.py
@ -406,6 +409,7 @@
- test/inductor/test_cpu_select_algorithm.py
- aten/src/ATen/cpu/**
- aten/src/ATen/native/quantized/cpu/**
- aten/src/ATen/test/vec_test_all_types.*
- test/quantization/core/test_quantized_op.py
- torch/ao/quantization/quantizer/x86_inductor_quantizer.py
- test/quantization/pt2e/test_x86inductor_quantizer.py
@ -413,6 +417,7 @@
- leslie-fang-intel
- jgong5
- EikanWang
- CaoE
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -11,16 +11,6 @@ jobs, but it also allows them to be cached properly to improve CI
reliability.
The list of support files are as follows:
* Conda:
* conda-env-iOS. This is used by iOS build and test jobs to setup the
conda environment
* conda-env-macOS-ARM64. This is used by MacOS (m1, arm64) build and
test jobs to setup the conda environment
* conda-env-Linux-X64. This is used by Linux buck build and test jobs
to setup the conda environment
* Pip:
* pip-requirements-iOS.txt. This is used by iOS build and test jobs to
setup the pip environment
* pip-requirements-macOS.txt. This is used by MacOS build and test jobs to
setup the pip environment

View File

@ -1,8 +0,0 @@
cmake=3.22.*
mkl=2022.1.0
mkl-include=2022.1.0
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
setuptools=72.1.0
typing-extensions=4.11.0

View File

@ -1,7 +0,0 @@
blas=1.0
cmake=3.22.1
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
setuptools=72.1.0
typing-extensions=4.11.0

View File

@ -1,22 +1,6 @@
numpy=1.22.3
pyyaml=6.0
setuptools=72.1.0
cmake=3.22.*
typing-extensions=4.11.0
dataclasses=0.8
pip=22.2.2
pillow=10.0.1
pkg-config=0.29.2
wheel=0.37.1
# NB: This is intentionally held back because anaconda main doesn't
# have updated expecttest, but you don't /need/ the updated version
# to run the tests. In the meantime I need to figure out how to
# cajole anaconda into updating, or get the package from pypi instead...
expecttest=0.1.3
# Not pinning certifi so that we can always get the latest certificates
certifi
# Cross-compiling arm64 from x86-64 picks up 1.40.0 while testing on arm64
# itself only has up to 1.39.0 from upstream conda. Both work though
libuv>=1.39.0,<=1.40.0
pip=23.2.1
pkg-config=0.29.2
setuptools=72.1.0
wheel=0.37.1

View File

@ -1,4 +0,0 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.13.0

View File

@ -1,33 +1,34 @@
boto3==1.35.42
hypothesis==6.56.4
cmake==3.25.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.6.0
hypothesis==6.56.4
librosa>=0.6.2
mpmath==1.3.0
networkx==2.8.7
# Use numba-0.49.1 or older on Intel Macs, but 0.56.0 on M1 machines, as older numba is not available
numba==0.56.0; platform_machine == "arm64"
numba<=0.49.1; platform_machine != "arm64"
ninja==1.10.2.4
numba==0.59.0
numpy==1.26.4
opt-einsum>=3.3
psutil==5.9.1
nvidia-ml-py==11.525.84
optree==0.13.0
packaging==23.1
parameterized==0.8.1
pillow==10.0.1
protobuf==5.29.4
psutil==5.9.1
pygments==2.15.0
pytest==7.3.2
pytest-xdist==3.3.1
pytest-rerunfailures==10.3
pytest-cpp==2.3.0
pytest-flakefinder==1.1.0
pytest-rerunfailures==10.3
pytest-subtests==0.13.1
scipy==1.10.1
pytest-xdist==3.3.1
pytest==7.3.2
pyyaml==6.0.2
scipy==1.12.0
sympy==1.13.3
tensorboard==2.13.0
typing-extensions==4.12.2
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
filelock==3.6.0
pytest-cpp==2.3.0
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.13.0
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2
parameterized==0.8.1

View File

@ -28,12 +28,12 @@ def main() -> None:
issue = repo.get_issue(issue_number)
issue_labels = issue.labels
docathon_label_present = any(
label.name == "docathon-h1-2024" for label in issue_labels
label.name == "docathon-h1-2025" for label in issue_labels
)
# if the issue has a docathon label, add all labels from the issue to the PR.
if not docathon_label_present:
print("The 'docathon-h1-2024' label is not present in the issue.")
print("The 'docathon-h1-2025' label is not present in the issue.")
return
pull_request_labels = pull_request.get_labels()
pull_request_label_names = [label.name for label in pull_request_labels]

View File

@ -21,7 +21,7 @@ CUDA_STABLE = "12.6"
CUDA_ARCHES_FULL_VERSION = {
"11.8": "11.8.0",
"12.6": "12.6.3",
"12.8": "12.8.0",
"12.8": "12.8.1",
}
CUDA_ARCHES_CUDNN_VERSION = {
"11.8": "9",
@ -72,20 +72,20 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.8": (
"nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"xpu": (
"intel-cmplr-lib-rt==2025.1.1 | "

View File

@ -31,6 +31,9 @@ python3 -m tools.pyi.gen_pyi \
--deprecated-functions-path "tools/autograd/deprecated.yaml"
python3 torch/utils/data/datapipes/gen_pyi.py
# Also check generated pyi files
find torch -name '*.pyi' -exec git add --force -- "{}" +
RC=0
# Run lintrunner on all files
if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
@ -41,6 +44,9 @@ if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS}
RC=1
fi
# Unstage temporally added pyi files
find torch -name '*.pyi' -exec git restore --staged -- "{}" +
# Use jq to massage the JSON lint output into GitHub Actions workflow commands.
jq --raw-output \
'"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))' \

View File

@ -27,6 +27,9 @@ unset ACCESS_TOKEN
# it does one job, stops and unregisters
registration_token=$(jq --raw-output .token "$token_file")
# workaround for https://gitlab.com/qemu-project/qemu/-/issues/2600
export DOTNET_EnableWriteXorExecute=0
./config.sh \
--unattended \
--ephemeral \
@ -44,8 +47,5 @@ rm -f "$token_file"
# and it doesn't work for non-root user
source venv/bin/activate
# workaround for https://gitlab.com/qemu-project/qemu/-/issues/2600
export DOTNET_EnableWriteXorExecute=0
# Run one job.
./run.sh

View File

@ -939,6 +939,12 @@ class GitHubPR:
summary=None,
)
# Making an exception for Apply lint auggestions/autoformat because the
# bot adds a merged label -> triggers workflow -> sometimes needs
# approval -> is read as failure, which results in a blocked merge, but
# this workflow doesn't provide mergability info
self.conclusions.pop("Apply lint suggestions", None)
return self.conclusions
def get_authors(self) -> dict[str, str]:

View File

@ -76,7 +76,7 @@ jobs:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
{%- if os == "windows-arm64" %}
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
{%- else %}
{%- if branches == "nightly" %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
@ -102,23 +102,12 @@ jobs:
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
@ -172,7 +161,7 @@ jobs:
- !{{ config["build_name"] }}-build
- get-label-type
{%- if os == "windows-arm64" %}
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
{%- else %}
{%- if config["gpu_arch_type"] == "cuda" %}
{%- if branches == "nightly" %}
@ -198,18 +187,11 @@ jobs:
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Populate binary env
- name: Enable long paths
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
@ -223,10 +205,6 @@ jobs:
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |

View File

@ -7,29 +7,25 @@ on:
ref:
type: string
required: true
run-url-lint:
type: boolean
required: false
default: false
jobs:
lint-urls:
if: ${{ inputs.run-url-lint }}
if: ${{ github.event_name != 'pull_request' || !contains(github.event.pull_request.labels.*.name, 'skip-url-lint') }}
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
with:
timeout: 120
runner: ${{ inputs.runner }}linux.2xlarge
docker-image: pytorch-linux-focal-linter
docker-image: ci-image:pytorch-linux-focal-linter
fetch-depth: 0
submodules: false
ref: ${{ inputs.ref }}
script: |
./scripts/lint_urls.sh $(
{ [ "${{ github.event_name }}" = "pull_request" ] \
&& git diff --name-only "${{ github.event.pull_request.base.sha }}...${{ github.event.pull_request.head.sha }}"; } \
|| \
{ [ "${{ github.event_name }}" = "push" ] \
&& git diff --name-only "${{ github.event.before }}...${{ github.sha }}"; }
if [ "${{ github.event_name }}" = "pull_request" ]; then
echo "${{ github.event.pull_request.base.sha }}" "${{ github.event.pull_request.head.sha }}"
else
echo "${{ github.event.before }}" "${{ github.sha }}"
fi
) || {
echo
echo "URL lint failed."
@ -44,17 +40,17 @@ jobs:
with:
timeout: 60
runner: ${{ inputs.runner }}linux.2xlarge
docker-image: pytorch-linux-focal-linter
docker-image: ci-image:pytorch-linux-focal-linter
fetch-depth: 0
submodules: false
ref: ${{ inputs.ref }}
script: |
./scripts/lint_xrefs.sh $(
{ [ "${{ github.event_name }}" = "pull_request" ] \
&& git diff --name-only "${{ github.event.pull_request.base.sha }}...${{ github.event.pull_request.head.sha }}"; } \
|| \
{ [ "${{ github.event_name }}" = "push" ] \
&& git diff --name-only "${{ github.event.before }}...${{ github.sha }}"; }
if [ "${{ github.event_name }}" = "pull_request" ]; then
echo "${{ github.event.pull_request.base.sha }}" "${{ github.event.pull_request.head.sha }}"
else
echo "${{ github.event.before }}" "${{ github.sha }}"
fi
) || {
echo
echo "Xref lint failed."

View File

@ -74,6 +74,32 @@ on:
Overwrite the number of jobs to use for the build
required: false
type: string
disable-monitor:
description: |
Disable utilization monitoring for build job
required: false
type: boolean
default: false
monitor-log-interval:
description: |
Set the interval for the monitor script to log utilization.
required: false
type: number
default: 5
monitor-data-collect-interval:
description: |
Set the interval for the monitor script to collect data.
required: false
type: number
default: 1
allow-reuse-old-whl:
description: |
If set, the build try to pull an old wheel from s3 that was built on a
commit with no cpp changes from this commit
required: false
type: boolean
default: false
secrets:
HUGGING_FACE_HUB_TOKEN:
@ -132,6 +158,15 @@ jobs:
role-session-name: gha-linux-build
aws-region: us-east-1
- name: Check if can use old whl build
id: use-old-whl
uses: ./.github/actions/reuse-old-whl
if: ${{ inputs.allow-reuse-old-whl && github.event_name == 'push' }}
with:
build-environment: ${{ inputs.build-environment }}
run-id: ${{ github.run_id }}
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -141,7 +176,7 @@ jobs:
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-environment != 'linux-s390x-binary-manywheel' && steps.use-old-whl.outputs.reuse != 'true'
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
@ -151,7 +186,7 @@ jobs:
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-environment != 'linux-s390x-binary-manywheel' && steps.use-old-whl.outputs.reuse != 'true'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -176,17 +211,38 @@ jobs:
selected-test-configs: ${{ inputs.selected-test-configs }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Start monitoring script
id: monitor-script
if: ${{ !inputs.disable-monitor }}
shell: bash
continue-on-error: true
env:
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
WORKFLOW_NAME: ${{ github.workflow }}
WORKFLOW_RUN_ID: ${{github.run_id}}
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
mkdir -p ../../usage_logs
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m tools.stats.monitor \
--log-interval "$MONITOR_LOG_INTERVAL" \
--data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" \
> "../../usage_logs/usage_log_build_${JOB_ID}.txt" 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-environment != 'linux-s390x-binary-manywheel' && steps.use-old-whl.outputs.reuse != 'true'
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
if: (steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == '') && steps.use-old-whl.outputs.reuse != 'true'
id: build
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
@ -280,14 +336,23 @@ jobs:
END_TIME=$(date +%s)
echo "build_time=$((END_TIME - START_TIME))" >> "$GITHUB_OUTPUT"
- name: Stop monitoring script
if: ${{ always() && steps.monitor-script.outputs.monitor-script-pid }}
shell: bash
continue-on-error: true
env:
MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
run: |
kill "$MONITOR_SCRIPT_PID"
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && steps.use-old-whl.outputs.reuse != 'true'
run: |
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a # v5.1.0
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-generates-artifacts && (steps.build.outcome != 'skipped' || steps.use-old-whl.outputs.reuse == 'true') && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -297,13 +362,32 @@ jobs:
- name: Store PyTorch Build Artifacts for s390x
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.build-environment == 'linux-s390x-binary-manywheel'
if: inputs.build-generates-artifacts && (steps.build.outcome != 'skipped' || steps.use-old-whl.outputs.reuse == 'true') && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: copy logs
shell: bash
if: ${{ always() && steps.build.outcome != 'skipped' && !inputs.disable-monitor && inputs.build-environment != 'linux-s390x-binary-manywheel'}}
continue-on-error: true
run: |
rm -f ./usage_logs
mkdir -p ./usage_logs
cp ../../usage_logs/usage_log_build_*.txt ./usage_logs/
- name: Upload raw usage log to s3
if: ${{ always() && steps.build.outcome != 'skipped' && !inputs.disable-monitor && inputs.build-environment != 'linux-s390x-binary-manywheel'}}
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
if-no-files-found: warn
path: usage_logs/usage_log_build_*.txt
- name: Upload sccache stats
if: steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'
uses: ./.github/actions/upload-sccache-stats
@ -311,6 +395,18 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
build-time: ${{ steps.build.outputs.build_time }}
- name: Upload utilization stats
if: ${{ always() && steps.build.outcome != 'skipped' && !inputs.disable-monitor && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
continue-on-error: true
uses: ./.github/actions/upload-utilization-stats
with:
job_id: ${{ steps.get-job-id.outputs.job-id }}
job_name: ${{ steps.get-job-id.outputs.job-name }}
workflow_name: ${{ github.workflow }}
workflow_run_id: ${{github.run_id}}
workflow_attempt: ${{github.run_attempt}}
artifact_prefix: usage_log_build_${{ steps.get-job-id.outputs.job-id }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'

View File

@ -376,7 +376,7 @@ jobs:
- name: Upload pytest cache if tests failed
uses: ./.github/actions/pytest-cache-upload
continue-on-error: true
if: failure() && steps.test.conclusion && steps.test.conclusion == 'failure'
if: failure() && steps.test.conclusion && steps.test.conclusion == 'failure' && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
cache_dir: .pytest_cache
shard: ${{ matrix.shard }}
@ -431,7 +431,7 @@ jobs:
path: ./**/core.[1-9]*
- name: Upload utilization stats
if: ${{ always() && steps.test.conclusion && steps.test.conclusion != 'skipped' && !inputs.disable-monitor }}
if: ${{ always() && steps.test.conclusion && steps.test.conclusion != 'skipped' && !inputs.disable-monitor && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
continue-on-error: true
uses: ./.github/actions/upload-utilization-stats
with:

View File

@ -30,7 +30,7 @@ on:
python-version:
required: false
type: string
default: "3.9"
default: "3.12"
description: |
The python version to be used. Will be 3.9 by default
test-matrix:
@ -85,8 +85,9 @@ jobs:
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
environment-file: .github/requirements/conda-env-macOS-ARM64
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
default-packages: ""
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e # v3.0.0

View File

@ -21,7 +21,7 @@ on:
python-version:
required: false
type: string
default: "3.9"
default: "3.12"
description: |
The python version to be used. Will be 3.9 by default
timeout-minutes:
@ -144,8 +144,9 @@ jobs:
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
environment-file: .github/requirements/conda-env-macOS-ARM64
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
default-packages: ""
- name: Parse ref
id: parse-ref
@ -278,6 +279,7 @@ jobs:
workflow_name: ${{ github.workflow }}
workflow_run_id: ${{github.run_id}}
workflow_attempt: ${{github.run_attempt}}
local_path: usage_log.txt
- name: Clean up disk space
if: always()

View File

@ -28,14 +28,14 @@ jobs:
repo: context.repo.repo,
issue_number: issueNumber
});
const hasLabel = issue.labels.some(label => label.name === 'docathon-h1-2024');
const hasLabel = issue.labels.some(label => label.name === 'docathon-h1-2025');
if (hasLabel) {
if (issue.assignee !== null) {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issueNumber,
body: "The issue is already assigned. Please pick an opened and unnasigned issue with the [docathon-h1-2024 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2024)."
body: "The issue is already assigned. Please pick an opened and unnasigned issue with the [docathon-h1-2025 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2025)."
});
} else {
await github.rest.issues.addAssignees({
@ -46,7 +46,7 @@ jobs:
});
}
} else {
const commmentMessage = "This issue does not have the correct label. Please pick an opened and unnasigned issue with the [docathon-h1-2024 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2024)."
const commmentMessage = "This issue does not have the correct label. Please pick an opened and unnasigned issue with the [docathon-h1-2025 label](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3Adocathon-h1-2025)."
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,

View File

@ -49,13 +49,11 @@ jobs:
matrix:
runner: [linux.12xlarge]
docker-image-name: [
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11,
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9,
pytorch-linux-focal-py3.9-clang10,
pytorch-linux-focal-py3.11-clang10,
@ -110,18 +108,6 @@ jobs:
always-rebuild: true
push: true
- name: Push docker image to old name
shell: bash
env:
ECR_DOCKER_IMAGE: ${{ steps.build-docker-image.outputs.docker-image }}
run: |
# This can be deleted after people have rebased their PRs/the new name has been in main for a while
set -euox pipefail
docker_image_name="308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}"
foldersha=${ECR_DOCKER_IMAGE##*-}
docker tag "${ECR_DOCKER_IMAGE}" "${docker_image_name}:${foldersha}"
docker push "${docker_image_name}:${foldersha}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:

View File

@ -136,7 +136,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -252,7 +252,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -368,7 +368,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -484,7 +484,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -600,7 +600,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -716,7 +716,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -155,7 +155,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing

View File

@ -269,7 +269,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -882,7 +882,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing
@ -1563,7 +1563,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing
@ -2176,7 +2176,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
@ -2789,7 +2789,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing
@ -3402,7 +3402,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing

View File

@ -50,7 +50,7 @@ jobs:
libtorch-cpu-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -77,23 +77,12 @@ jobs:
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
@ -138,7 +127,7 @@ jobs:
needs:
- libtorch-cpu-shared-with-deps-debug-build
- get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -160,18 +149,11 @@ jobs:
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Populate binary env
- name: Enable long paths
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
@ -185,10 +167,6 @@ jobs:
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |

View File

@ -50,7 +50,7 @@ jobs:
libtorch-cpu-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -77,23 +77,12 @@ jobs:
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
@ -138,7 +127,7 @@ jobs:
needs:
- libtorch-cpu-shared-with-deps-release-build
- get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -160,18 +149,11 @@ jobs:
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Populate binary env
- name: Enable long paths
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
@ -185,10 +167,6 @@ jobs:
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |

View File

@ -50,7 +50,7 @@ jobs:
wheel-py3_11-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -73,23 +73,12 @@ jobs:
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
@ -134,7 +123,7 @@ jobs:
needs:
- wheel-py3_11-cpu-build
- get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -152,18 +141,11 @@ jobs:
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Populate binary env
- name: Enable long paths
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
@ -177,10 +159,6 @@ jobs:
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |
@ -219,7 +197,7 @@ jobs:
wheel-py3_12-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -242,23 +220,12 @@ jobs:
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
@ -303,7 +270,7 @@ jobs:
needs:
- wheel-py3_12-cpu-build
- get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -321,18 +288,11 @@ jobs:
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Populate binary env
- name: Enable long paths
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
@ -346,10 +306,6 @@ jobs:
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |
@ -388,7 +344,7 @@ jobs:
wheel-py3_13-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -411,23 +367,12 @@ jobs:
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Enable long paths
shell: cmd
run: |
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
@ -472,7 +417,7 @@ jobs:
needs:
- wheel-py3_13-cpu-build
- get-label-type
runs-on: "windows-11-arm64"
runs-on: "windows-11-arm64-preview"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
@ -490,18 +435,11 @@ jobs:
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Populate binary env
- name: Enable long paths
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
git config --system core.longpaths true
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
@ -515,10 +453,6 @@ jobs:
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |

View File

@ -27,15 +27,15 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-focal-cuda12_6-py3_10-gcc9-inductor-micro-benchmark-build:
name: cuda12.6-py3.10-gcc9-sm80
build:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -43,13 +43,13 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-micro-benchmark-test:
name: cuda12.6-py3.10-gcc9-sm80
test:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-micro-benchmark-build
needs: build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-micro-benchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-micro-benchmark-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 720
secrets: inherit

View File

@ -24,15 +24,15 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-focal-cuda12_6-py3_10-gcc9-inductor-build:
name: cuda12.6-py3.10-gcc9-sm80
build:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -43,14 +43,14 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test:
name: cuda12.6-py3.10-gcc9-sm80
test:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
# disable monitor in perf tests for more investigation
disable-monitor: false
monitor-log-interval: 15

View File

@ -79,13 +79,13 @@ jobs:
# NB: Keep this in sync with trunk.yml
build:
name: cuda12.6-py3.10-gcc9-sm90
name: cuda12.8-py3.10-gcc9-sm90
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm90
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '9.0'
test-matrix: |
{ include: [
@ -111,12 +111,12 @@ jobs:
secrets: inherit
test-nightly:
name: cuda12.6-py3.10-gcc9-sm90
name: cuda12.8-py3.10-gcc9-sm90
uses: ./.github/workflows/_linux-test.yml
needs: build
if: github.event.schedule == '0 7 * * 1-6'
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm90
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
@ -128,12 +128,12 @@ jobs:
secrets: inherit
test-weekly:
name: cuda12.6-py3.10-gcc9-sm90
name: cuda12.8-py3.10-gcc9-sm90
uses: ./.github/workflows/_linux-test.yml
needs: build
if: github.event.schedule == '0 7 * * 0'
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm90
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-maxautotune-true-freeze_autotune_cudagraphs-true-cudagraphs_low_precision-true
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
@ -145,12 +145,12 @@ jobs:
secrets: inherit
test:
name: cuda12.6-py3.10-gcc9-sm90
name: cuda12.8-py3.10-gcc9-sm90
uses: ./.github/workflows/_linux-test.yml
needs: build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm90
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cudagraphs-${{ inputs.cudagraphs }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}-maxautotune-${{ inputs.maxautotune }}-freezing_cudagraphs-${{ inputs.freezing_cudagraphs }}-cudagraphs_low_precision-${{ inputs.cudagraphs }}
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}

View File

@ -42,7 +42,7 @@ jobs:
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
python-version: 3.12.7
test-matrix: |
{ include: [
{ config: "perf_smoketest", shard: 1, num_shards: 1, runner: "macos-m2-15" },
@ -56,8 +56,10 @@ jobs:
with:
build-environment: macos-py3-arm64-distributed
# Same as the build job
python-version: 3.9.12
python-version: 3.12.7
test-matrix: ${{ needs.macos-perf-py3-arm64-build.outputs.test-matrix }}
# disable monitor in perf tests for more investigation
disable-monitor: true
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit

View File

@ -78,14 +78,14 @@ jobs:
opt_out_experiments: lf
# NB: Keep this in sync with trunk.yml
linux-focal-cuda12_6-py3_10-gcc9-inductor-build:
name: cuda12.6-py3.10-gcc9-sm80
build:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -112,32 +112,32 @@ jobs:
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test-nightly:
name: cuda12.6-py3.10-gcc9-sm80
test-nightly:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: build
if: github.event.schedule == '0 7 * * 1-6'
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 720
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test-weekly:
name: cuda12.6-py3.10-gcc9-sm80
test-weekly:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: build
if: github.event.schedule == '0 7 * * 0'
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-maxautotune-true-freeze_autotune_cudagraphs-true-cudagraphs_low_precision-true
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 1440
# disable monitor in perf tests, next step is to enable it
disable-monitor: false
@ -145,16 +145,16 @@ jobs:
monitor-data-collect-interval: 4
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test:
name: cuda12.6-py3.10-gcc9-sm80
test:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cudagraphs-${{ inputs.cudagraphs }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}-maxautotune-${{ inputs.maxautotune }}-freezing_cudagraphs-${{ inputs.freezing_cudagraphs }}-cudagraphs_low_precision-${{ inputs.cudagraphs }}
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 720
disable-monitor: false
monitor-log-interval: 15

View File

@ -29,14 +29,14 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-focal-cuda12_6-py3_10-gcc9-periodic-dynamo-benchmarks-build:
name: cuda12.6-py3.10-gcc9-sm86-periodic-dynamo-benchmarks
linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-build:
name: cuda12.8-py3.10-gcc9-sm86-periodic-dynamo-benchmarks
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
@ -58,14 +58,14 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-periodic-dynamo-benchmarks-test:
name: cuda12.6-py3.10-gcc9-sm86-periodic-dynamo-benchmarks
linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-test:
name: cuda12.8-py3.10-gcc9-sm86-periodic-dynamo-benchmarks
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-periodic-dynamo-benchmarks-build
needs: linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-periodic-dynamo-benchmarks-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-periodic-dynamo-benchmarks-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-rocm-py3_10-periodic-dynamo-benchmarks-build:
@ -109,15 +109,15 @@ jobs:
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-periodic-dynamo-benchmarks-build.outputs.test-matrix }}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-build-gcp:
name: cuda12.6-py3.10-gcc9-sm80
linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-build:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -125,14 +125,14 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test-gcp:
name: cuda12.6-py3.10-gcc9-sm80
linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-test:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build-gcp
needs: linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-periodic-dynamo-benchmarks-build:
@ -170,16 +170,16 @@ jobs:
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-build:
name: cuda12.6-py3.10-gcc9-sm86
linux-jammy-cuda12_8-py3_10-gcc9-inductor-build:
name: cuda12.8-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
sync-tag: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
sync-tag: linux-jammy-cuda12_8-py3_10-gcc9-inductor-build
test-matrix: |
{ include: [
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
@ -195,14 +195,14 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test:
name: cuda12.6-py3.10-gcc9-sm86
linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
name: cuda12.8-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: linux-jammy-cuda12_8-py3_10-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-build:

View File

@ -38,12 +38,12 @@ jobs:
opt_out_experiments: lf
linux-jammy-rocm-py3_10-inductor-build:
name: rocm-py3.10-inductor
name: rocm-py3.10-inductor-mi300
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
build-environment: linux-jammy-rocm-py3.10-mi300
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
@ -56,11 +56,11 @@ jobs:
permissions:
id-token: write
contents: read
name: rocm-py3.10-inductor
name: rocm-py3.10-inductor-mi300
uses: ./.github/workflows/_rocm-test.yml
needs: linux-jammy-rocm-py3_10-inductor-build
with:
build-environment: linux-jammy-rocm-py3.10
build-environment: linux-jammy-rocm-py3.10-mi300
docker-image: ${{ needs.linux-jammy-rocm-py3_10-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-inductor-build.outputs.test-matrix }}
secrets: inherit

View File

@ -43,6 +43,7 @@ jobs:
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.2" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.2" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-jammy-rocm-py3_10-inductor-test:

View File

@ -26,13 +26,13 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-focal-cuda12_6-py3_10-gcc9-inductor-build:
linux-jammy-cuda12_6-py3_10-gcc9-inductor-build:
name: cuda12.6-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.6-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
@ -43,25 +43,26 @@ jobs:
{ config: "inductor_cpp_wrapper", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test:
linux-jammy-cuda12_6-py3_10-gcc9-inductor-test:
name: cuda12.6-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: linux-jammy-cuda12_6-py3_10-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.6-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-jammy-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-focal-cuda12_6-py3_12-gcc9-inductor-build:
linux-jammy-cuda12_6-py3_12-gcc9-inductor-build:
name: cuda12.6-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-focal-cuda12.6-py3.12-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.6-py3.12-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
@ -69,16 +70,17 @@ jobs:
{ config: "inductor", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_6-py3_12-gcc9-inductor-test:
linux-jammy-cuda12_6-py3_12-gcc9-inductor-test:
name: cuda12.6-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_12-gcc9-inductor-build
needs: linux-jammy-cuda12_6-py3_12-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.6-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_6-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_12-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.6-py3.12-gcc9-sm86
docker-image: ${{ needs.linux-jammy-cuda12_6-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_6-py3_12-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_12-inductor-halide-build:
@ -93,6 +95,7 @@ jobs:
{ include: [
{ config: "inductor-halide", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-jammy-cpu-py3_12-inductor-halide-test:
@ -117,6 +120,7 @@ jobs:
{ include: [
{ config: "inductor-triton-cpu", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-jammy-cpu-py3_12-inductor-triton-cpu-test:
@ -144,6 +148,7 @@ jobs:
{ config: "inductor_avx2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "inductor_avx2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test:
@ -156,27 +161,28 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-focal-cuda12_6-py3_13-gcc9-inductor-build:
linux-jammy-cuda12_6-py3_13-gcc9-inductor-build:
name: cuda12.6-py3.13-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-focal-cuda12.6-py3.13-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.6-py3.13-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_6-py3_13-gcc9-inductor-test:
linux-jammy-cuda12_6-py3_13-gcc9-inductor-test:
name: cuda12.6-py3.13-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_13-gcc9-inductor-build
needs: linux-jammy-cuda12_6-py3_13-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.6-py3.13-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_6-py3_13-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_13-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.6-py3.13-gcc9-sm86
docker-image: ${{ needs.linux-jammy-cuda12_6-py3_13-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_6-py3_13-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit

View File

@ -42,16 +42,16 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-focal-cuda12_6-py3_10-gcc9-inductor-build:
name: cuda12.6-py3.10-gcc9-sm86
linux-jammy-cuda12_8-py3_10-gcc9-inductor-build:
name: cuda12.8-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
sync-tag: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
sync-tag: linux-jammy-cuda12_8-py3_10-gcc9-inductor-build
test-matrix: |
{ include: [
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
@ -60,16 +60,17 @@ jobs:
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc9-inductor-test:
name: cuda12.6-py3.10-gcc9-sm86
linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
name: cuda12.8-py3.10-gcc9-sm86
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_6-py3_10-gcc9-inductor-build
needs: linux-jammy-cuda12_8-py3_10-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.6-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-build:
@ -92,6 +93,7 @@ jobs:
{ config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.amx" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.24xl.spr-metal" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test:

View File

@ -283,6 +283,15 @@ jobs:
# All we need to see is that it passes
python3 torch/utils/collect_env.py
link-check:
name: Link checks
needs: get-label-type
uses: ./.github/workflows/_link_check.yml
with:
runner: ${{ needs.get-label-type.outputs.label-type }}
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
secrets: inherit
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -23,7 +23,7 @@ jobs:
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
python-version: 3.12.7
# The runner macos-m2-14 is not a typo, it's a custom runner that is different
# than our AWS macos-m1-14 runners
test-matrix: |
@ -42,6 +42,7 @@ jobs:
sync-tag: macos-py3-arm64-mps-test
build-environment: macos-py3-arm64
# Same as the build job
python-version: 3.9.12
python-version: 3.12.7
test-matrix: ${{ needs.macos-py3-arm64-build.outputs.test-matrix }}
disable-monitor: false
secrets: inherit

View File

@ -34,7 +34,6 @@ jobs:
with:
runner: ${{ needs.get-label-type.outputs.label-type }}
ref: ${{ github.sha }}
run-url-lint: true
secrets: inherit
docs-build:

View File

@ -50,12 +50,12 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
linux-jammy-rocm-py3_10-build:
name: linux-jammy-rocm-py3.10
name: linux-jammy-rocm-py3.10-mi300
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
build-environment: linux-jammy-rocm-py3.10-mi300
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
@ -69,13 +69,13 @@ jobs:
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3.10
name: linux-jammy-rocm-py3.10-mi300
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
build-environment: linux-jammy-rocm-py3.10-mi300
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit

View File

@ -138,6 +138,7 @@ jobs:
{ config: "default", shard: 6, num_shards: 6, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
]}
sync-tag: asan-build
allow-reuse-old-whl: true
secrets: inherit
@ -202,6 +203,7 @@ jobs:
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-py3_9-clang10-test:
@ -237,6 +239,7 @@ jobs:
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-py3_13-clang10-test:
@ -250,14 +253,14 @@ jobs:
timeout-minutes: 600
secrets: inherit
linux-focal-cuda11_8-py3_10-gcc9-build:
name: linux-focal-cuda11.8-py3.10-gcc9
linux-focal-cuda12_6-py3_10-gcc11-build-distributed:
name: linux-focal-cuda12.6-py3.10-gcc11-build-distributed
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image-name: ci-image:pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
build-environment: linux-focal-cuda12.6-py3.10-gcc11-distributed
docker-image-name: ci-image:pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11
cuda-arch-list: '7.5'
test-matrix: |
{ include: [
@ -267,17 +270,17 @@ jobs:
]}
secrets: inherit
linux-focal-cuda11_8-py3_10-gcc9-test:
name: linux-focal-cuda11.8-py3.10-gcc9
linux-focal-cuda12_6-py3_10-gcc11-test-distributed:
name: linux-focal-cuda12.6-py3.10-gcc11-test
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda11_8-py3_10-gcc9-build
- linux-focal-cuda12_6-py3_10-gcc11-build-distributed
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda11.8-py3.10-gcc9
docker-image: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-build.outputs.test-matrix }}
build-environment: linux-focal-cuda12.6-py3.10-gcc11-distributed
docker-image: ${{ needs.linux-focal-cuda12_6-py3_10-gcc11-build-distributed.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_6-py3_10-gcc11-build-distributed.outputs.test-matrix }}
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc11-build:
@ -296,6 +299,7 @@ jobs:
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc11-test:
@ -364,25 +368,6 @@ jobs:
test-matrix: ${{ needs.linux-focal-py3_9-clang9-xla-build.outputs.test-matrix }}
secrets: inherit
win-vs2022-cpu-py3-build:
# don't run build twice on main
if: github.event_name == 'pull_request'
name: win-vs2022-cpu-py3
uses: ./.github/workflows/_win-build.yml
needs: get-label-type
with:
build-environment: win-vs2022-cpu-py3
cuda-version: cpu
sync-tag: win-cpu-build
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
]}
secrets: inherit
linux-focal-cpu-py3_10-gcc11-bazel-test:
name: linux-focal-cpu-py3.10-gcc11-bazel-test
uses: ./.github/workflows/_bazel-build-test.yml
@ -449,6 +434,7 @@ jobs:
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
unstable-linux-focal-cuda12_6-py3_10-gcc11-sm89-build-xfail:
@ -469,6 +455,7 @@ jobs:
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_6-py3_10-gcc11-sm89-test:
@ -507,29 +494,30 @@ jobs:
test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}
secrets: inherit
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
name: cuda12.4-py3.10-gcc9-sm75
linux-jammy-cuda12_8-py3_10-gcc9-inductor-build:
name: cuda12.8-py3.10-gcc9-sm75
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm75
docker-image-name: ci-image:pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm75
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '7.5'
test-matrix: |
{ include: [
{ config: "pr_time_benchmarks", shard: 1, num_shards: 1, runner: "linux.g4dn.metal.nvidia.gpu" },
]}
allow-reuse-old-whl: true
secrets: inherit
linux-focal-cuda12_4-py3_10-gcc9-inductor-test:
name: cuda12.4-py3.10-gcc9-sm75
linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
name: cuda12.8-py3.10-gcc9-sm75
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-inductor-build
needs: linux-jammy-cuda12_8-py3_10-gcc9-inductor-build
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm75
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-inductor-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm75
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-xpu-2025_1-py3_9-build:

View File

@ -38,12 +38,12 @@ jobs:
linux-jammy-rocm-py3_10-build:
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
name: linux-jammy-rocm-py3.10
name: linux-jammy-rocm-py3.10-mi300
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
build-environment: linux-jammy-rocm-py3.10-mi300
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
@ -61,13 +61,13 @@ jobs:
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3.10
name: linux-jammy-rocm-py3.10-mi300
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
build-environment: linux-jammy-rocm-py3.10-mi300
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit

View File

@ -148,6 +148,7 @@ jobs:
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
]}
sync-tag: asan-build
allow-reuse-old-whl: true
secrets: inherit
linux-jammy-py3_10-clang15-asan-test:

View File

@ -5,6 +5,8 @@ on:
paths:
- .github/workflows/test-h100.yml
workflow_dispatch:
schedule:
- cron: 0 4,10,16,22 * * * # every 6 hours
push:
tags:
- ciflow/h100/*

View File

@ -21,15 +21,15 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-focal-cuda12_4-py3_10-gcc9-torchbench-build-gcp:
name: cuda12.4-py3.10-gcc9-sm80
build:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
@ -37,12 +37,12 @@ jobs:
]}
secrets: inherit
linux-focal-cuda12_4-py3_10-gcc9-torchbench-test-gcp:
name: cuda12.4-py3.10-gcc9-sm80
test:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_4-py3_10-gcc9-torchbench-build-gcp
needs: build
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-torchbench-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_4-py3_10-gcc9-torchbench-build-gcp.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
secrets: inherit

View File

@ -86,7 +86,7 @@ jobs:
runner-type: macos-m1-stable
build-generates-artifacts: true
# To match the one pre-installed in the m1 runners
python-version: 3.9.12
python-version: 3.12.7
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "macos-m1-stable" },
@ -107,8 +107,9 @@ jobs:
with:
build-environment: macos-py3-arm64
# Same as the build job
python-version: 3.9.12
python-version: 3.12.7
test-matrix: ${{ needs.macos-py3-arm64-build.outputs.test-matrix }}
disable-monitor: false
secrets: inherit
win-vs2022-cpu-py3-build:
@ -118,7 +119,6 @@ jobs:
with:
build-environment: win-vs2022-cpu-py3
cuda-version: cpu
sync-tag: win-cpu-build
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
@ -187,13 +187,13 @@ jobs:
secrets: inherit
# NB: Keep this in sync with inductor-perf-test-nightly.yml
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
name: cuda12.4-py3.10-gcc9-sm80
linux-jammy-cuda12_8-py3_10-gcc9-inductor-build:
name: cuda12.8-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-focal-cuda12.4-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
secrets: inherit

1
.gitignore vendored
View File

@ -47,6 +47,7 @@ docs/source/generated/
docs/source/compile/generated/
log
usage_log.txt
usage_log*
test-reports/
test/*.bak
test/**/*.bak

View File

@ -152,12 +152,12 @@ init_command = [
'numpy==1.26.4 ; python_version >= "3.9" and python_version <= "3.11"',
'numpy==2.1.0 ; python_version >= "3.12"',
'expecttest==0.3.0',
'mypy==1.14.0',
'mypy==1.15.0',
'sympy==1.13.3',
'types-requests==2.27.25',
'types-PyYAML==6.0.7',
'types-tabulate==0.8.8',
'types-protobuf==3.19.18',
'types-protobuf==5.29.1.20250403',
'types-pkg-resources==0.1.3',
'types-Jinja2==2.11.9',
'types-colorama==0.4.6',
@ -1160,12 +1160,6 @@ exclude_patterns = [
'torch/_inductor/autoheuristic/artifacts/**',
# These files are all grandfathered in, feel free to remove from this list
# as necessary
'test/_nvfuser/__init__.py',
'test/_nvfuser/test_dynamo.py',
'test/_nvfuser/test_python_frontend.py',
'test/_nvfuser/test_torchscript.py',
'test/delete.py',
'test/expect/__init__.py',
'test/quantization/__init__.py',
'test/quantization/core/__init__.py',
'test/quantization/core/experimental/apot_fx_graph_mode_ptq.py',
@ -1193,8 +1187,6 @@ exclude_patterns = [
'test/quantization/fx/test_numeric_suite_fx.py',
'test/quantization/fx/test_quantize_fx.py',
'test/quantization/fx/test_subgraph_rewriter.py',
'test/test_fake_tensor.py',
'test/test_flop_counter.py',
'test/test_function_schema.py',
'test/test_functional_autograd_benchmark.py',
'test/test_functional_optim.py',
@ -1324,13 +1316,6 @@ exclude_patterns = [
'torch/_export/passes/const_prop_pass.py',
'torch/_export/passes/functionalize_side_effectful_ops_pass.py',
'torch/_export/passes/replace_sym_size_ops_pass.py',
'torch/_export/passes/replace_view_ops_with_view_copy_ops_pass.py',
'torch/_export/serde/__init__.py',
'torch/_export/serde/schema.py',
'torch/_export/serde/serialize.py',
'torch/_export/serde/upgrade.py',
'torch/_export/trace.py',
'torch/_export/verifier.py',
'torch/testing/_internal/__init__.py',
'torch/testing/_internal/autocast_test_lists.py',
'torch/testing/_internal/autograd_function_db.py',
@ -1447,7 +1432,6 @@ exclude_patterns = [
'torch/utils/throughput_benchmark.py',
'torch/utils/viz/__init__.py',
'torch/utils/viz/_cycles.py',
'torch/utils/weak.py',
]
init_command = [
'python3',
@ -1521,7 +1505,7 @@ code = 'RUFF'
include_patterns = [
'**/*.py',
'**/*.pyi',
'torch/utils/data/*.ipynb',
'**/*.ipynb',
'pyproject.toml',
]
exclude_patterns = [
@ -1552,7 +1536,7 @@ init_command = [
]
is_formatter = true
# This linter prevents merge conlicts in csv files in pytorch by enforcing
# This linter prevents merge conflicts in csv files in pytorch by enforcing
# three lines of whitespace between entries such that unless people are modifying
# the same line, merge conflicts should not arise in git or hg
[[linter]]
@ -1736,3 +1720,15 @@ command = [
include_patterns = [
'test/**/test_*.py',
]
# 'header_only_linter' reports on properly testing header-only APIs.
[[linter]]
code = 'HEADER_ONLY_LINTER'
command = [
'python3',
'tools/linter/adapters/header_only_linter.py',
]
include_patterns = [
'torch/header_only_apis.txt',
]
is_formatter = false

View File

@ -985,12 +985,11 @@ endif()
include(cmake/public/utils.cmake)
if(NOT MSVC)
string(APPEND CMAKE_CXX_FLAGS " -O2 -fPIC")
if(NOT USE_XPU)
# This prevents use of `c10::optional`, `c10::nullopt` etc within the codebase
string(APPEND CMAKE_CXX_FLAGS " -DC10_NODEPRECATED")
string(APPEND CMAKE_CUDA_FLAGS " -DC10_NODEPRECATED")
string(APPEND CMAKE_OBJCXX_FLAGS " -DC10_NODEPRECATED")
endif()
# This prevents use of `c10::optional`, `c10::nullopt` etc within the codebase
string(APPEND CMAKE_CXX_FLAGS " -DC10_NODEPRECATED")
string(APPEND CMAKE_CUDA_FLAGS " -DC10_NODEPRECATED")
string(APPEND CMAKE_OBJCXX_FLAGS " -DC10_NODEPRECATED")
# Eigen fails to build with some versions, so convert this to a warning
# Details at http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1459

View File

@ -14,6 +14,7 @@
/torch/csrc/autograd/ @albanD @soulitzer
/torch/autograd/ @albanD @soulitzer
/tools/autograd/ @albanD @soulitzer
/torch/header_only_apis.txt @janeyx99
/torch/nn/ @albanD @jbschlosser @mikaylagawarecki
/torch/optim/ @albanD @janeyx99
/test/test_public_bindings.py @albanD
@ -21,6 +22,7 @@
/test/forward_backward_compatibility/check_forward_backward_compatibility.py @larryliu0820
/docs/source/conf.py @albanD
/aten/src/ATen/native/tags.yaml @ezyang
/.github/merge_rules.yaml @albanD @malfet
# Architecture Optimization (quantization, sparsity, etc.)
/aten/src/ATen/native/ao_sparse @salilsdesai @kimishpatel @digantdesai @jianyuh
@ -49,12 +51,12 @@ nn/qat/ @jerryzh168
/torch/csrc/distributed/c10d/Ops.* @kwen2501
# ONNX Export
/torch/_dynamo/backends/onnxrt.py @wschin @xadupre
/torch/csrc/jit/passes/onnx.h @titaiwangms @shubhambhokare1 @xadupre
/torch/csrc/jit/passes/onnx.cpp @titaiwangms @shubhambhokare1 @xadupre
/torch/csrc/jit/passes/onnx/ @titaiwangms @shubhambhokare1 @xadupre
/torch/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin @xadupre
/test/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin @xadupre
/torch/_dynamo/backends/onnxrt.py @wschin
/torch/csrc/jit/passes/onnx.h @titaiwangms @shubhambhokare1
/torch/csrc/jit/passes/onnx.cpp @titaiwangms @shubhambhokare1
/torch/csrc/jit/passes/onnx/ @titaiwangms @shubhambhokare1
/torch/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin
/test/onnx/ @titaiwangms @shubhambhokare1 @justinchuby @wschin
# CI
/.ci @pytorch/pytorch-dev-infra

View File

@ -112,8 +112,7 @@ source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
lazy.)
```bash
conda uninstall pytorch -y
yes | pip uninstall torch
pip uninstall torch
```
Next run `python setup.py clean`. After that, you can install in `develop` mode again.
@ -180,14 +179,6 @@ You can use this script to check out a new nightly branch with the following:
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
```
Or if you would like to re-use an existing conda environment, you can pass in
the prefix argument (`--prefix`):
```bash
./tools/nightly.py checkout -b my-nightly-branch -p my-env
source my-env/bin/activate # or `& .\my-env\Scripts\Activate.ps1` on Windows
```
To install the nightly binaries built with CUDA, you can pass in the flag `--cuda`:
```bash
@ -289,7 +280,7 @@ dependencies as well as the nightly binaries into the repo directory.
### Python Unit Testing
**Prerequisites**:
The following packages should be installed with either `conda` or `pip`:
The following packages should be installed with `pip`:
- `expecttest` and `hypothesis` - required to run tests
- `mypy` - recommended for linting
- `pytest` - recommended to run tests more selectively
@ -497,8 +488,7 @@ pip install -r requirements.txt
# Or if you prefer an uncontaminated global executable environment or do not want to go through the node configuration:
# npm install katex && export PATH="$PATH:$(pwd)/node_modules/.bin"
```
> Note: if you installed `nodejs` with a different package manager (e.g.,
`conda`) then `npm` will probably install a version of `katex` that is not
> Note: if you installed `nodejs` with a different package manager then `npm` will probably install a version of `katex` that is not
compatible with your version of `nodejs` and doc builds will fail.
A combination of versions that is known to work is `node@6.13.1` and
`katex@0.13.18`. To install the latter with `npm` you can run
@ -670,13 +660,13 @@ you run `import torch` anywhere else, the development version will be
used).
If you want to manage multiple builds of PyTorch, you can make use of
[conda environments](https://conda.io/docs/using/envs.html) to maintain
[venv environments](https://docs.python.org/3/library/venv.html) to maintain
separate Python package environments, each of which can be tied to a
specific build of PyTorch. To set one up:
```bash
conda create -n pytorch-myfeature
source activate pytorch-myfeature
python -m venv pytorch-myfeature
source pytorch-myfeature/bin/activate # or `& .\pytorch-myfeature\Scripts\Activate.ps1` on Windows
# if you run python now, torch will NOT be installed
python setup.py develop
```
@ -754,7 +744,6 @@ same. Using ccache in a situation like this is a real time-saver.
Before building pytorch, install ccache from your package manager of choice:
```bash
conda install ccache -c conda-forge
sudo apt install ccache
sudo yum install ccache
brew install ccache
@ -1046,8 +1035,7 @@ than Linux, which are worth keeping in mind when fixing these problems.
3. If you have a Windows box (we have a few on EC2 which you can request access to) and
you want to run the build, the easiest way is to just run `.ci/pytorch/win-build.sh`.
If you need to rebuild, run `REBUILD=1 .ci/pytorch/win-build.sh` (this will avoid
blowing away your Conda environment.)
If you need to rebuild, run `REBUILD=1 .ci/pytorch/win-build.sh`.
Even if you don't know anything about MSVC, you can use cmake to build simple programs on
Windows; this can be helpful if you want to learn more about some peculiar linking behavior
@ -1264,7 +1252,7 @@ in the meantime there will be some separation.
There are a few "unusual" directories which, for historical reasons,
are Caffe2/PyTorch specific. Here they are:
- `CMakeLists.txt`, `Makefile`, `binaries`, `cmake`, `conda`, `modules`,
- `CMakeLists.txt`, `Makefile`, `binaries`, `cmake`, `modules`,
`scripts` are Caffe2-specific. Don't put PyTorch code in them without
extra coordination.

View File

@ -373,8 +373,9 @@ The patch release process takes around 4-5 weeks to complete.
* Should the new patch release be created?
* Timeline execution for the patch release
3. Cherry picking phase starts after the decision is made to create a patch release. At this point, a new release tracker for the patch release is created, and an announcement will be made on official channels [example announcement](https://dev-discuss.pytorch.org/t/pytorch-release-2-0-1-important-information/1176). The authors of the fixes to regressions will be asked to create their own cherry picks. This process normally takes 2 weeks.
4. Building Binaries, Promotion to Stable and testing. After all cherry picks have been merged, Release Managers trigger a new build and produce a new release candidate. An announcement is made on the official channel about the RC availability at this point. This process normally takes 2 weeks.
5. General Availability
4. Updating `version.txt` in the release branch to match expected patch release version, see https://github.com/pytorch/pytorch/commit/f77213d3dae5d103a39cdaf93f21863843571e8d as an example
5. Building Binaries, Promotion to Stable and testing. After all cherry picks have been merged, Release Managers trigger a new build and produce a new release candidate. An announcement is made on the official channel about the RC availability at this point. This process normally takes 2 weeks.
6. General Availability
### Triage

View File

@ -144,8 +144,8 @@ new_local_repository(
new_local_repository(
name = "asmjit",
build_file = "//third_party:fbgemm/third_party/asmjit.BUILD",
path = "third_party/fbgemm/third_party/asmjit",
build_file = "//third_party:fbgemm/external/asmjit.BUILD",
path = "third_party/fbgemm/external/asmjit",
)
new_local_repository(
@ -184,6 +184,12 @@ new_local_repository(
path = "third_party/nlohmann",
)
new_local_repository(
name = "moodycamel",
build_file = "//third_party:moodycamel.BUILD",
path = "third_party/concurrentqueue",
)
new_local_repository(
name = "tensorpipe",
build_file = "//third_party:tensorpipe.BUILD",

View File

@ -2,7 +2,9 @@
## Demo applications and tutorials
Demo applications with code walk-through can be find in [this github repo](https://github.com/pytorch/android-demo-app).
Please refer to [pytorch-labs/executorch-examples](https://github.com/pytorch-labs/executorch-examples/tree/main/dl3/android/DeepLabV3Demo) for the Android demo app based on [ExecuTorch](https://github.com/pytorch/executorch).
Please join our [Discord](https://discord.com/channels/1334270993966825602/1349854760299270284) for any questions.
## Publishing
@ -119,8 +121,6 @@ We also have to add all transitive dependencies of our aars.
As `pytorch_android` [depends](https://github.com/pytorch/pytorch/blob/master/android/pytorch_android/build.gradle#L76-L77) on `'com.facebook.soloader:nativeloader:0.10.5'` and `'com.facebook.fbjni:fbjni-java-only:0.2.2'`, we need to add them.
(In case of using maven dependencies they are added automatically from `pom.xml`).
You can check out [test app example](https://github.com/pytorch/pytorch/blob/master/android/test_app/app/build.gradle) that uses aars directly.
## Linking to prebuilt libtorch library from gradle dependency
In some cases, you may want to use libtorch from your android native build.
@ -202,7 +202,7 @@ find_library(FBJNI_LIBRARY fbjni
NO_CMAKE_FIND_ROOT_PATH)
target_link_libraries(${PROJECT_NAME}
${PYTORCH_LIBRARY})
${PYTORCH_LIBRARY}
${FBJNI_LIBRARY})
```
@ -233,8 +233,6 @@ void loadAndForwardModel(const std::string& modelPath) {
To load torchscript model for mobile we need some special setup which is placed in `struct JITCallGuard` in this example. It may change in future, you can track the latest changes keeping an eye in our [pytorch android jni code]([https://github.com/pytorch/pytorch/blob/master/android/pytorch_android/src/main/cpp/pytorch_jni_jit.cpp#L28)
[Example of linking to libtorch from aar](https://github.com/pytorch/pytorch/tree/master/android/test_app)
## PyTorch Android API Javadoc
You can find more details about the PyTorch Android API in the [Javadoc](https://pytorch.org/javadoc/).

View File

@ -1,30 +0,0 @@
#!/bin/bash
set -eux
PYTORCH_DIR="$(cd $(dirname $0)/..; pwd -P)"
PYTORCH_ANDROID_DIR=$PYTORCH_DIR/android
echo "PYTORCH_DIR:$PYTORCH_DIR"
source "$PYTORCH_ANDROID_DIR/common.sh"
check_android_sdk
check_gradle
parse_abis_list "$@"
build_android
# To set proxy for gradle add following lines to ./gradle/gradle.properties:
# systemProp.http.proxyHost=...
# systemProp.http.proxyPort=8080
# systemProp.https.proxyHost=...
# systemProp.https.proxyPort=8080
if [ "$CUSTOM_ABIS_LIST" = true ]; then
NDK_DEBUG=1 $GRADLE_PATH -PnativeLibsDoNotStrip=true -PABI_FILTERS=$ABIS_LIST -p $PYTORCH_ANDROID_DIR clean test_app:assembleDebug
else
NDK_DEBUG=1 $GRADLE_PATH -PnativeLibsDoNotStrip=true -p $PYTORCH_ANDROID_DIR clean test_app:assembleDebug
fi
find $PYTORCH_ANDROID_DIR -type f -name *apk
find $PYTORCH_ANDROID_DIR -type f -name *apk | xargs echo "To install apk run: $ANDROID_HOME/platform-tools/adb install -r "

View File

@ -1,32 +0,0 @@
#!/bin/bash
###############################################################################
# This script tests the custom selective build flow for PyTorch Android, which
# optimizes library size by only including ops used by a specific model.
###############################################################################
set -eux
PYTORCH_DIR="$(cd $(dirname $0)/..; pwd -P)"
PYTORCH_ANDROID_DIR="${PYTORCH_DIR}/android"
BUILD_ROOT="${PYTORCH_DIR}/build_pytorch_android_custom"
source "${PYTORCH_ANDROID_DIR}/common.sh"
prepare_model_and_dump_root_ops() {
cd "${BUILD_ROOT}"
MODEL="${BUILD_ROOT}/MobileNetV2.pt"
ROOT_OPS="${BUILD_ROOT}/MobileNetV2.yaml"
python "${PYTORCH_ANDROID_DIR}/test_app/make_assets_custom.py"
cp "${MODEL}" "${PYTORCH_ANDROID_DIR}/test_app/app/src/main/assets/mobilenet2.pt"
}
# Start building
mkdir -p "${BUILD_ROOT}"
check_android_sdk
check_gradle
parse_abis_list "$@"
prepare_model_and_dump_root_ops
SELECTED_OP_LIST="${ROOT_OPS}" build_android
# TODO: change this to build test_app instead
$GRADLE_PATH -PABI_FILTERS=$ABIS_LIST -p $PYTORCH_ANDROID_DIR clean assembleRelease

View File

@ -3,4 +3,3 @@ include ':app', ':pytorch_android', ':pytorch_android_torchvision', ':pytorch_ho
project(':pytorch_android_torchvision').projectDir = file('pytorch_android_torchvision')
project(':pytorch_host').projectDir = file('pytorch_android/host')
project(':test_app').projectDir = file('test_app/app')

View File

@ -1,9 +0,0 @@
local.properties
**/*.iml
.gradle
gradlew*
gradle/wrapper
.idea/*
.DS_Store
build
.externalNativeBuild

View File

@ -1,38 +0,0 @@
cmake_minimum_required(VERSION 3.5)
set(PROJECT_NAME pytorch_testapp_jni)
project(${PROJECT_NAME} CXX)
set(CMAKE_CXX_STANDARD 17 CACHE STRING "The C++ standard whose features are requested to build this target.")
set(CMAKE_VERBOSE_MAKEFILE ON)
set(build_DIR ${CMAKE_SOURCE_DIR}/build)
set(pytorch_testapp_cpp_DIR ${CMAKE_CURRENT_LIST_DIR}/src/main/cpp)
message(STATUS "ANDROID_STL:${ANDROID_STL}")
file(GLOB pytorch_testapp_SOURCES
${pytorch_testapp_cpp_DIR}/pytorch_testapp_jni.cpp
)
add_library(${PROJECT_NAME} SHARED
${pytorch_testapp_SOURCES}
)
file(GLOB PYTORCH_INCLUDE_DIRS "${build_DIR}/pytorch_android*.aar/headers")
file(GLOB PYTORCH_LINK_DIRS "${build_DIR}/pytorch_android*.aar/jni/${ANDROID_ABI}")
target_compile_options(${PROJECT_NAME} PRIVATE
-fexceptions
)
set(BUILD_SUBDIR ${ANDROID_ABI})
target_include_directories(${PROJECT_NAME} PRIVATE
${PYTORCH_INCLUDE_DIRS}
)
find_library(PYTORCH_LIBRARY pytorch_jni
PATHS ${PYTORCH_LINK_DIRS}
NO_CMAKE_FIND_ROOT_PATH)
target_link_libraries(${PROJECT_NAME}
${PYTORCH_LIBRARY}
log)

View File

@ -1,190 +0,0 @@
apply plugin: 'com.android.application'
repositories {
jcenter()
maven {
url "https://oss.sonatype.org/content/repositories/snapshots"
}
flatDir {
dirs 'aars'
}
}
android {
configurations {
extractForNativeBuild
}
compileOptions {
sourceCompatibility 1.8
targetCompatibility 1.8
}
compileSdkVersion rootProject.compileSdkVersion
buildToolsVersion rootProject.buildToolsVersion
defaultConfig {
applicationId "org.pytorch.testapp"
minSdkVersion rootProject.minSdkVersion
targetSdkVersion rootProject.targetSdkVersion
versionCode 1
versionName "1.0"
ndk {
abiFilters ABI_FILTERS.split(",")
}
// Commented due to dependency on local copy of pytorch_android aar to aars folder
//externalNativeBuild {
// cmake {
// abiFilters ABI_FILTERS.split(",")
// arguments "-DANDROID_STL=c++_shared"
// }
//}
buildConfigField("String", "MODULE_ASSET_NAME", "\"mobilenet2q.pt\"")
buildConfigField("String", "LOGCAT_TAG", "@string/app_name")
buildConfigField("long[]", "INPUT_TENSOR_SHAPE", "new long[]{1, 3, 224, 224}")
buildConfigField("boolean", "NATIVE_BUILD", 'false')
buildConfigField("boolean", "USE_VULKAN_DEVICE", 'false')
buildConfigField(
"int",
"BUILD_LITE_INTERPRETER",
System.env.BUILD_LITE_INTERPRETER != null ? System.env.BUILD_LITE_INTERPRETER : "1"
)
addManifestPlaceholders([APP_NAME: "@string/app_name", MAIN_ACTIVITY: "org.pytorch.testapp.MainActivity"])
}
buildTypes {
debug {
minifyEnabled false
debuggable true
}
release {
minifyEnabled false
}
}
// Commented due to dependency on local copy of pytorch_android aar to aars folder
//externalNativeBuild {
// cmake {
// path "CMakeLists.txt"
// }
//}
flavorDimensions "model", "build", "activity"
productFlavors {
mnet {
dimension "model"
applicationIdSuffix ".mnet"
buildConfigField("String", "MODULE_ASSET_NAME", "\"mobilenet_v2.ptl\"")
addManifestPlaceholders([APP_NAME: "MNET"])
buildConfigField("String", "LOGCAT_TAG", "\"pytorch-mnet\"")
}
// NB: This is not working atm https://github.com/pytorch/pytorch/issues/102966
mnetVulkan {
dimension "model"
applicationIdSuffix ".mnet_vulkan"
buildConfigField("String", "MODULE_ASSET_NAME", "\"mobilenet_v2_vulkan.ptl\"")
buildConfigField("boolean", "USE_VULKAN_DEVICE", 'true')
addManifestPlaceholders([APP_NAME: "MNET_VULKAN"])
buildConfigField("String", "LOGCAT_TAG", "\"pytorch-mnet-vulkan\"")
}
resnet18 {
dimension "model"
applicationIdSuffix ".resnet18"
buildConfigField("String", "MODULE_ASSET_NAME", "\"resnet18.ptl\"")
addManifestPlaceholders([APP_NAME: "RN18"])
buildConfigField("String", "LOGCAT_TAG", "\"pytorch-resnet18\"")
}
local {
dimension "build"
}
nightly {
dimension "build"
}
aar {
dimension "build"
}
// Commented due to dependency on local copy of pytorch_android aar to aars folder
//nativeBuild {
// dimension "build"
// buildConfigField("boolean", "NATIVE_BUILD", "true")
//}
camera {
dimension "activity"
addManifestPlaceholders([MAIN_ACTIVITY: "org.pytorch.testapp.CameraActivity"])
}
base {
dimension "activity"
sourceSets {
main {
java {
exclude 'org/pytorch/testapp/CameraActivity.java'
}
}
}
}
}
packagingOptions {
doNotStrip '**.so'
}
// Filtering for CI
if (!testAppAllVariantsEnabled.toBoolean()) {
variantFilter { variant ->
def names = variant.flavors*.name
if (names.contains("nightly")
|| names.contains("camera")
|| names.contains("aar")
|| names.contains("nativeBuild")) {
setIgnore(true)
}
}
}
}
tasks.all { task ->
// Disable externalNativeBuild for all but nativeBuild variant
if (task.name.startsWith('externalNativeBuild')
&& !task.name.contains('NativeBuild')) {
task.enabled = false
}
}
dependencies {
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'com.facebook.soloader:nativeloader:0.10.5'
localImplementation project(':pytorch_android')
localImplementation project(':pytorch_android_torchvision')
// Commented due to dependency on local copy of pytorch_android aar to aars folder
//nativeBuildImplementation(name: 'pytorch_android-release', ext: 'aar')
//nativeBuildImplementation(name: 'pytorch_android_torchvision-release', ext: 'aar')
//extractForNativeBuild(name: 'pytorch_android-release', ext: 'aar')
nightlyImplementation 'org.pytorch:pytorch_android:2.2.0-SNAPSHOT'
nightlyImplementation 'org.pytorch:pytorch_android_torchvision:2.2.0-SNAPSHOT'
aarImplementation(name:'pytorch_android', ext:'aar')
aarImplementation(name:'pytorch_android_torchvision', ext:'aar')
aarImplementation 'com.facebook.soloader:nativeloader:0.10.5'
aarImplementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'
def camerax_version = "1.0.0-alpha05"
cameraImplementation "androidx.camera:camera-core:$camerax_version"
cameraImplementation "androidx.camera:camera-camera2:$camerax_version"
cameraImplementation 'com.google.android.material:material:1.0.0-beta01'
}
task extractAARForNativeBuild {
doLast {
configurations.extractForNativeBuild.files.each {
def file = it.absoluteFile
copy {
from zipTree(file)
into "$buildDir/$file.name"
include "headers/**"
include "jni/**"
}
}
}
}
tasks.whenTaskAdded { task ->
if (task.name.contains('externalNativeBuild')) {
task.dependsOn(extractAARForNativeBuild)
}
}

View File

@ -1,27 +0,0 @@
<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android"
package="org.pytorch.testapp">
<application
android:allowBackup="true"
android:label="${APP_NAME}"
android:supportsRtl="true"
android:theme="@style/AppTheme">
<activity android:name="${MAIN_ACTIVITY}">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
</application>
<uses-permission android:name="android.permission.CAMERA" />
<!--
Permissions required by the Snapdragon Profiler to collect GPU metrics.
-->
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
</manifest>

View File

@ -1,3 +0,0 @@
*
*/
!.gitignore

Some files were not shown because too many files have changed in this diff Show More