Compare commits

..

168 Commits

Author SHA1 Message Date
3b9b4065af Leave ROCm alone for now
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-11 21:20:56 -07:00
e1f586a43e Install the correct torchao version
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-11 19:45:43 -07:00
18dc2e03ac Merge branch 'main' into install-torchao-0.13.0
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-10 23:01:07 -07:00
23170dfebc Revert "Move inductor jobs 3.9->3.10 (#162323)"
This reverts commit 0663bdb12383b9717af49d58aed9d88de0dd0ecc.

Reverted https://github.com/pytorch/pytorch/pull/162323 on behalf of https://github.com/huydhn due to Not sure what had happened, but some inductor unit tests start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/162323#issuecomment-3278125192))
2025-09-11 05:57:13 +00:00
12e993f533 compile_kernel large shared memory fix (#162647)
Alternate solution to https://github.com/pytorch/pytorch/pull/162328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647
Approved by: https://github.com/eqy
2025-09-11 05:52:46 +00:00
07d2531672 [vllm hash update] update the pinned vllm hash (#162551)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162551
Approved by: https://github.com/pytorchbot
2025-09-11 04:56:04 +00:00
6944d4b639 [ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-09-11 03:34:07 +00:00
f654cff566 [inductor] Add shape to load_input in matmul templates (#162513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513
Approved by: https://github.com/eellison
ghstack dependencies: #162426
2025-09-11 01:51:15 +00:00
f17c5e0789 [inductor] Add shape for store_output in matmul templates (#162426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162426
Approved by: https://github.com/eellison
2025-09-11 01:51:15 +00:00
435c18fb4a [DTensor] add op support for aten.unbind.int (#162560)
As titled.

It seems unbind returns views of the original tensor. E.g. see https://stackoverflow.com/questions/78910951/does-unbind-return-the-views-of-tensors-in-pytorch

So we error out when `shard_dim == unbind_dim`. This is similar to why we error out in view ops.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_view_ops.py#L544-L546

This PR also refactors some other tensor ops code, by creating two utils function `shift_shard_dims_after_insert`, `shift_shard_dims_after_remove`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162560
Approved by: https://github.com/zpcore
2025-09-11 00:58:23 +00:00
612cdc8f48 -ldl for nativert tests (#162643)
Fixes #162640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162643
Approved by: https://github.com/yiming0416, https://github.com/robert-hardwick
2025-09-11 00:35:57 +00:00
da5069f289 Don't include cuh header when USE_NVSHMEM is off (#162635)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162635
Approved by: https://github.com/kwen2501
2025-09-11 00:24:50 +00:00
4fd2a2b273 Add cuda headers automatically for compile_kernel (#162634)
Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst

Benefit is now we can add

`#include <cuda_fp16.h>` without crapping out
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634
Approved by: https://github.com/ngimel
2025-09-11 00:20:33 +00:00
bb1d53bc47 [CD] CUDA 13 specific followup changes (#162455)
Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779
sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm.
remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455
Approved by: https://github.com/atalman
2025-09-11 00:03:47 +00:00
36338fc7f2 Relax fences for intrusive ptr's refcnt (#162072)
Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing.

lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read.

incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning.

decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is.

This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072
Approved by: https://github.com/swolchok, https://github.com/yfeldblum
2025-09-10 23:17:01 +00:00
e0c910149c Build fbgemm_gpu for TORCH_CUDA_ARCH_LIST=10.0 and CUDA 12.8 and 12.9 (#162544)
## Summary
- pytorch is not built for *a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209)

## Changes
- **Setting USE_FBGEMM_GENAI for CUDA builds**: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](2033a0a08f/.github/scripts/nova_dir.bash (L29-L32))), so I follow the same rule here.
- **Extra nvcc flags**: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a

## Test plan

Test build:
```
echo $CUDA_HOME
/usr/local/cuda-12.9

export TORCH_CUDA_ARCH_LIST=10.0
python -m pip install --no-build-isolation -v -e .
```

Check build logs:
```
  CMake Warning at CMakeLists.txt:901 (message):
    Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a
```

Run unit tests:
- `pytest test/test_matmul_cuda.py  -k test_mxfp8_scaled_grouped_mm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544
Approved by: https://github.com/drisspg
2025-09-10 22:59:41 +00:00
f4aeceaa9d Use upper bound for persistent rblock (#162441)
Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2.

Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441
Approved by: https://github.com/PaulZhang12
2025-09-10 22:29:02 +00:00
d7c3d8a551 Merge branch 'main' into install-torchao-0.13.0
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-10 15:14:32 -07:00
d8e6b2fddc [Cutlass] Add exp and sigmoid activations (#162536)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162536
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #162535
2025-09-10 21:44:26 +00:00
31c25c7d01 [Cutlass] Add tanh activation and test case for activations (#162535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162535
Approved by: https://github.com/henrylhtsang
2025-09-10 21:44:26 +00:00
eqy
5dbee5691c [cuDNN][Convolution][TF32][64bit] Add tf32_on_and_off decorator to conv3d 64bit test (#161004)
cuDNN has new generated kernels that can use TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161004
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2025-09-10 21:39:35 +00:00
864ffe12d7 Fix some edge cases (#162295)
``` Summary
🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64)  ┆ 111.08814         ┆ 112.447047           ┆ 1.012233                  ┆ 1.22327   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64)  ┆ 78.23082          ┆ 76.693169            ┆ 0.980345                  ┆ -1.965531 │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663          ┆ 95.573333            ┆ 0.985733                  ┆ -1.426717 │
│ alibi          ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64)  ┆ 93.373473         ┆ 92.294147            ┆ 0.988441                  ┆ -1.155924 │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147          ┆ 96.105389            ┆ 0.991273                  ┆ -0.872685 │
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295
Approved by: https://github.com/mlazos, https://github.com/v0i0
2025-09-10 21:33:45 +00:00
4e35594674 [Lowering] Fix the edge case of empty subgraph split due to dataclass node (#161716)
Summary: Fix the edge case by allowing `call_function` nodes with no deps as graph entry (starter_nodes) in the splitter.

Test Plan:
The test shall pass in the current diff (after fix), and fail in the parent diff (before fix)

```
buck test mode/opt //glow/fb/fx/lowering:split_tests -- test_dataclass_as_graph_entry
```

Rollback Plan:

Differential Revision: D81232435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161716
Approved by: https://github.com/ezyang
2025-09-10 21:23:42 +00:00
35d7b32159 Improve device info with new flops and bandwidth formula based on hardware libraries (#162245)
Previously, DeviceInfo provided theoretical hardware information based on a hardcoded list manually created from various datasheets.

This update:
- Attempting to gather the information from a hardware library like `pynvml`, improving accuracy and expanding support to devices that don't have entries in the datasheet list.
- Adjusts flops and bw calculation based on these hardware values. For example, if the the memory or SMs are underclocked, it adjusts the theoretical max flops/bw accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162245
Approved by: https://github.com/v0i0, https://github.com/shunting314
2025-09-10 21:19:13 +00:00
0663bdb123 Move inductor jobs 3.9->3.10 (#162323)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-09-10 20:58:41 +00:00
40ea6e418a Revert "Fix decorators skipping NCCL tests (#158846)"
This reverts commit c2388201fc85b0748173212de5a17514c7a71f21.

Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some inductor tests ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3276471387))
2025-09-10 20:51:31 +00:00
348303ebd2 [ez] add docstring/typing for codegen_kernel_benchmark (#162609)
```
lintrunner init && lintrunner -m origin/main
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609
Approved by: https://github.com/coconutruben
ghstack dependencies: #162442
2025-09-10 20:49:38 +00:00
94755e81c4 [inductor] Enable combo kernels with unbacked inputs (#162442)
Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes.

### Example exception

```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel
    kernel_code_list = self.generate_combo_kernel_code(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code
    src_code = kernel.codegen_kernel()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel
    code.splice(self.codegen_kernel_benchmark(num_gb=0))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark
    var_names.extend(self.kernel_benchmark_extra_args())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args
    extra_args.append(str(V.graph.sizevars.size_hint(tree.numel)))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint
    return int(out)
           ^^^^^^^^
  File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int
```

Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442
Approved by: https://github.com/jansel
2025-09-10 20:49:38 +00:00
6d65737aee testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
2025-09-10 20:48:12 +00:00
053251b98d Revert "Make functorch notebook symlinks PEP 517 valid (#157813)"
This reverts commit b494547f0bd6cb1ce5d8d104cb419802434c9c08.

Reverted https://github.com/pytorch/pytorch/pull/157813 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but this surfaces a weird discrepancy between GitHub and Mecurial used internally ([comment](https://github.com/pytorch/pytorch/pull/157813#issuecomment-3276442242))
2025-09-10 20:45:48 +00:00
7e2e83cdbe [ONNX] Update export docstring (#162622)
Update export docstring to reflect the latest configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622
Approved by: https://github.com/titaiwangms
2025-09-10 20:29:46 +00:00
d033d11d26 Revert "[torch][c10d] fix split_group in mixed backend case (#162424)"
This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d.

Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))
2025-09-10 20:13:44 +00:00
80d4da893c Revert "Put torchao (0.13.0) back to benchmark workflow (#162227)"
This reverts commit 00985970e312c3c5e674e8e14d39fe77c226600e.

Reverted https://github.com/pytorch/pytorch/pull/162227 on behalf of https://github.com/huydhn due to Crashing some inductor jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/162227#issuecomment-3276355034))
2025-09-10 20:11:37 +00:00
bf7f481144 Update misleading torch.sparse_coo_tensor error check (#161900)
Fixes #160622

### Summary
Updated the misleading torch.sparse_coo_tensor error check to provide clear context.
earlier:
`RuntimeError: number of dimensions must be sparse_dim (3) + dense_dim (0), but got 1`

Updated:
`RuntimeError: 'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = 1, sparse_dim = 3, dense_dim = 0`

**Impacts:**

- Comprehensive error message that will improve developer experience.
- module: sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161900
Approved by: https://github.com/nikitaved, https://github.com/pearu
2025-09-10 19:57:11 +00:00
ab0694f1c6 [ROCm][Inductor][CK backend] Install rocm-composable-kernel python package on ROCm Linux CI docker images (#162288)
Reopened from #158747 which got reverted since without setuptools-scm in pytorch index URL the wheel cannot be built

We reconsider the original PR idea of introducing CK as a pytorch dependency on ROCm Linux and install the CK python package in CI only -- since (1) rocm-composable-kernel depends on setuptools-scm which depends on tomli and the existing index URLs need to be modified to host the new packages and (2) there also is a packaging [bug](https://github.com/pypa/setuptools/issues/3269#issuecomment-1254507377) in Ubuntu 22.04 which prevents correct dynamic version calculation with default system pip.

Extras:

 ->   this PR reconsiders how TORCHINDUCTOR_CK_DIR env variable is used; previously, this var was used to point to rocm-composable-kernel package installation path on the filesystem; now, the path is inferred by trying to import ck4inductor
 ->   the tests are updated to reflect this change
 ->   since in CI clang points to a bash script which invokes sccache, we cannot patch PATH to not contain sccache, this logic is removed from the testing code
->    scaled_mm test crashes during the benchmarking when the benchmarking happens in the main process, and times out benchmarking when it happens in a subprocess, on gfx942, so it is disabled

TBD: roll back rocm-mi300 workflow before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162288
Approved by: https://github.com/jeffdaily
2025-09-10 19:33:40 +00:00
5f630d28d7 [dynamo][guards] Do not construct entire framelocals dict for LAMBDA_GUARD (#162525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162525
Approved by: https://github.com/williamwen42
ghstack dependencies: #162509
2025-09-10 18:52:15 +00:00
a67e798cb7 [dynamo][guards] Prevent framelocals to dict conversion for not required LAMBDA_GUARD (#162509)
This is a smaller PR to reduce framelocals to dict conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162509
Approved by: https://github.com/williamwen42
2025-09-10 18:52:15 +00:00
30191fcf03 [inductor][choices] rename get_mm_configs to get_template_configs (#162293)
# why

- eventually we want all templates to go through this
- we're exposing this through diode as a sort of interface/API
- avoid later renaming

# what

- rename get_mm_configs to get_template_configs
- rename _finalize_mm_configs to _finalize_template_configs

# testing

- lintrunner
- ci

Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293
Approved by: https://github.com/eellison
ghstack dependencies: #161351, #161350
2025-09-10 18:47:44 +00:00
623e623c82 [inductor] leverage template stacking in V.choices.get_mm_configs (#161350)
# why

- now everything is in place to just gather templates and run
  the V.choices.get_mm_configs once per op
- enables any overrides inside V.choices.get_mm_configs to
  have a full view of the options for an op, not just for
  one template

# what

- replace multiple calls to V.choices.get_mm_configs with
  calls to gather the active templates, and then using those
  in a single call

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161351
2025-09-10 18:47:44 +00:00
f08487aa86 [inductor] FlexibleLayout for ExternKernelChoice for mms (#161351)
# why

- if we only use ExternKernelChoice we're not doing any codegen
- if we're not doing any codegen, we can use a FlexibleLayout
  here, and provide deeper passes more chances to change it

# what

- if all the kernel template choices (KTC) are with a ExternKernelChoice
  template, we switch to a FlexibleLayout before generating the choice
- add a test to make sure that works as intended (FlexibleLayout for
  only extern, and FixedLayout if Triton is involved)

- caveats:
    - because CPP, CUTLASS, and CK are not using
       V.choices.get_mm_configs yet, we turn off the optimization
       if either of those backends are in use. This will be relaxed
       once they support this too
    - because Triton templates are still using their own calls
       (not a single call) to get_mm_configs, it's also turned
       off there. The next diff unifies Triton + ATEN to a single
       call to get_mm_configs and that in turn allows the optimization
       there too

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-10 18:47:34 +00:00
1051c7dbc2 Don't unconditionally import torch._dynamo, it's slow (#162595)
A trivial test on OS X.

Before:

```
real	0m6.550s
user	0m2.532s
sys	0m3.359s
```

After:

```
real	0m2.607s
user	0m1.898s
sys	0m3.344s
```

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162595
Approved by: https://github.com/albanD
2025-09-10 17:21:03 +00:00
suo
2dc2613180 [torch][c10d] fix split_group in mixed backend case (#162424)
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.

However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.

This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.

Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.

As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
2025-09-10 16:59:18 +00:00
582d278983 Build and Install Arm Compute Library in manylinux docker image (#159737)
----

This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms.

In this PR:

- We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build.  Also updated jammy install path to be /acl too.
- We can therefore remove build_ArmComputeLibrary functions from the ci build scripts.
- There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update )
- We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ).
- ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737
Approved by: https://github.com/seemethere
ghstack dependencies: #160078
2025-09-10 15:39:38 +00:00
b5e6e58050 [nn] Assert parsed iterable arguments are an appropriate length (#162340)
Fixes #162327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162340
Approved by: https://github.com/Skylion007
2025-09-10 15:15:49 +00:00
fefc406a3d fix typo: summit -> submit (#162587)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162587
Approved by: https://github.com/justinchuby
2025-09-10 14:43:53 +00:00
3d32bb114b [CD] Aarch64 Fix packaging `libarm_compute.so` and other libraries to the aarch64 CUDA wheels (#162566)
Fixes aarch64 linux packaging, following error:
https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62
```
Traceback (most recent call last):
  File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module>
    import torch
  File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libarm_compute.so: cannot open shared object file: No such file or directory
```
Due to missing dependencies.

Current Error:
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Hence the repackaging does not take any effect.

This PR does following
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl  deleted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl

Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file.
```
2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling
2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1
2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5
2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so
2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so
2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0
2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0
2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0
2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0
2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64
2025-09-10T10:10:14.1503482Z  to Tag: cp310-cp310-manylinux_2_28_aarch64
2025-09-10T10:10:14.1503682Z
2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
```

Test Plan, Executed on local file:
```
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD
Bundling CUDA libraries with wheel
Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64
 to Tag: cp310-cp310-manylinux_2_28_aarch64

Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts
Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl..
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566
Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug
2025-09-10 14:22:41 +00:00
de05dbc39c Replace export_for_training with export (#162396)
Summary: replace export_for_training with epxort

Test Plan:
CI

Rollback Plan:

Differential Revision: D81935792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162396
Approved by: https://github.com/angelayi, https://github.com/jerryzh168
2025-09-10 14:19:34 +00:00
fc1b09a52a Revert "Fix DCE eliminating in-place operations by improving Node.is_impure() (#162267)"
This reverts commit b9a7d0e13b4a34be83c778734dbad437c7c5117b.

Reverted https://github.com/pytorch/pytorch/pull/162267 on behalf of https://github.com/malfet due to Not sure how it happened, but looks like it broke everything, see c2388201fc/1 ([comment](https://github.com/pytorch/pytorch/pull/162267#issuecomment-3275164109))
2025-09-10 14:12:22 +00:00
c2388201fc Fix decorators skipping NCCL tests (#158846)
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using `unittest.skip` decorators avoids the starting of the test in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
2025-09-10 12:25:42 +00:00
a6f9e0e62a [c10d][nvshmem] fix override function modifier (#162515)
Summary: Fix compilation error in fbsource by missing override modifier

Differential Revision: D82038876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162515
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-09-10 11:35:49 +00:00
337fe1079d [nativert] AOTI delegate with flat inputs and outputs (#162538)
Summary: `executorch_call_delegate` should have flattened inputs and outputs. So that it can be correctly serialized and the input/output specs are consistent with runtime.

Test Plan:
CI

Rollback Plan:

Differential Revision: D82064354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162538
Approved by: https://github.com/dolpm
2025-09-10 11:35:44 +00:00
b494547f0b Make functorch notebook symlinks PEP 517 valid (#157813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157813
Approved by: https://github.com/zou3519, https://github.com/atalman
2025-09-10 10:13:24 +00:00
d9832d8425 [triton][export] serialization in internal path + unit tests (#162200)
Summary: will package triton artifacts to be runnable in nativert if wrappers exist.

Test Plan:
unit tests

Rollback Plan:

Differential Revision: D81368559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162200
Approved by: https://github.com/angelayi
2025-09-10 09:49:10 +00:00
f0ae3a57f6 [Optimus] Add batch dropout pattern (#162443)
Summary: We observe dropout pattern in AFOC, such add a new pattern to Optimus

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_batch_dropout_pre_grad_fusion
```

Buck UI: https://www.internalfb.com/buck2/2c899fb5-6e8b-43eb-8fb3-b53abfbfa6d9
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598805248688
Network: Up: 0B  Down: 0B  (reSessionID-bfbb9e6a-7e2a-425a-a027-b44282cef419)
Executing actions. Remaining     0/3                                                                                                     1.3s exec time total
Command: test.     Finished 2 local
Time elapsed: 1:22.3s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

baseline
f791163796

proposal
f793225207

Rollback Plan:

Differential Revision: D81981264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162443
Approved by: https://github.com/Yuzhen11, https://github.com/mlazos
2025-09-10 09:49:01 +00:00
26b3ae5890 Move prioritized text linker optimization code from setup.py to cmake (#160078)
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.

### Summary

🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )

This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.

### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.

Note:

Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.

Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
2025-09-10 09:21:53 +00:00
be8095b07f [DeviceMesh] Clarifying flatten use case (#161311)
Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding:
1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached).
2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in https://github.com/pytorch/pytorch/pull/160709 but it does not fixed the check for the case when we call the `_flatten` twice.

What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why?
1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__).
2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that  line will never be reached if we error out before that.

Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161311
Approved by: https://github.com/fegin
2025-09-10 07:46:51 +00:00
b2d8f6a6af [OpenReg] Update the docs about Accelerator Integration (#162046)
Fix the issue describled by this [comment](https://github.com/pytorch/pytorch/pull/161845#discussion_r2317299390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162046
Approved by: https://github.com/albanD
2025-09-10 07:45:07 +00:00
98e22c8a69 Skip test_ind_worker_queue on Windows and macOS (flaky) (#162555)
Fixes https://github.com/pytorch/pytorch/issues/68643

It was closed by the bot yesterday and the issue was still there https://github.com/pytorch/pytorch/actions/runs/17595694816/job/49989589647.  It's better to just skip it directly in the code as this test has been disabled on Windows and MacOS since 2021 O_o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162555
Approved by: https://github.com/clee2000
2025-09-10 07:05:14 +00:00
e1f0a69943 Revert "test fixing benchmarks (#162503)"
This reverts commit 484c4093a87a3e6767e55ed553f95db8fc137442.

Reverted https://github.com/pytorch/pytorch/pull/162503 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it regresses CPU perf smoke test ([comment](https://github.com/pytorch/pytorch/pull/162503#issuecomment-3273554680))
2025-09-10 06:55:35 +00:00
833997a6fd [Inductor][UT] Fix flex attention related inductor cases (#162450)
## Motivation
Fixes #162435, Fixes #162436

UT failures:
* https://github.com/pytorch/pytorch/actions/runs/17523991468/job/49772651636
* https://github.com/pytorch/pytorch/actions/runs/17523991468/job/49772651637

To fix flex attention related cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162450
Approved by: https://github.com/drisspg
2025-09-10 06:48:00 +00:00
b9a7d0e13b Fix DCE eliminating in-place operations by improving Node.is_impure() (#162267)
Change is_impure to check in-place operations on Node to prevent eliminate_dead_code from eliminating in-place operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162267
Approved by: https://github.com/ezyang
2025-09-10 06:02:15 +00:00
1c16c18a53 [nativert][triton] improve hardware registration (#162499)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D82031814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162499
Approved by: https://github.com/angelayi
2025-09-10 04:52:57 +00:00
96ef26f71a Revert "[ROCm] Integrate AITER Fav3 fwd kernels (#160105)"
This reverts commit d2393c2d7da03a1523a12e6f80edb6bd7b464ec5.

Reverted https://github.com/pytorch/pytorch/pull/160105 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internal ROCm build ([comment](https://github.com/pytorch/pytorch/pull/160105#issuecomment-3273297183))
2025-09-10 04:42:28 +00:00
5ac112b569 [dynamo] Graph break on on user-defined class in compiled region (#161670)
Currently, user-defined classes inside of a compiled frame will cause the whole
frame to be skipped by dynamo.  This change defers the Unsupported exception
until the __build_class__ builtin is actually called, which allows a graph break
to be inserted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas
2025-09-10 04:39:20 +00:00
dda071587f Revert "Make distributed modules importable even when backend not built (#159889)" (#162568)
This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568
Approved by: https://github.com/huydhn
2025-09-10 04:29:42 +00:00
11acfed3ce [audio hash update] update the pinned audio hash (#162552)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162552
Approved by: https://github.com/pytorchbot
2025-09-10 04:24:39 +00:00
5f40a8a9a3 [BE] Fix '_WIN32' is not defined warning (#162516)
Summary: As indeed it is not defined neither on  Linux nor on MacOS platforms

Test Plan:
CI

Rollback Plan:

Differential Revision: D82044853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162516
Approved by: https://github.com/Skylion007
2025-09-10 04:21:38 +00:00
e64965300a Repackage vLLM nightlies (#162371)
I suspected that I would need to repack vLLM wheels from https://github.com/pytorch/pytorch/pull/162000 because I renamed the wheel, and it turns out to be true.  The error is as follows:

```
$ uv pip install --pre xformers --index-url https://download.pytorch.org/whl/nightly/cu129
Using Python 3.12.11+meta environment at: venv/py3.12
Resolved 28 packages in 759ms
error: Failed to install: xformers-0.0.33.dev20250901+cu129-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (xformers==0.0.33.dev20250901+cu129)
  Caused by: Wheel version does not match filename: 0.0.33+5d4b92a5.d20250907 != 0.0.33.dev20250901+cu129
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162371
Approved by: https://github.com/atalman
2025-09-10 04:02:34 +00:00
00985970e3 Put torchao (0.13.0) back to benchmark workflow (#162227)
0.13.0 was released on Sep 3rd https://pypi.org/project/torchao/#history, which should have fixed the crashing issue on transformers now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162227
Approved by: https://github.com/malfet
2025-09-10 03:56:25 +00:00
484c4093a8 test fixing benchmarks (#162503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162503
Approved by: https://github.com/huydhn
ghstack dependencies: #160741
2025-09-10 03:15:49 +00:00
760c478a14 [FlexAttn][Minor] Update FlexConfig doc (#162533)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162533
Approved by: https://github.com/drisspg
2025-09-10 02:03:48 +00:00
dc4f97e9c1 [triton] enable int64 indexing in convolution and mm template (#162506)
Summary: hitting illegal memory access issue when compiling conv and addmm kernels with the change in https://github.com/pytorch/pytorch/pull/157767

Differential Revision: D81995664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162506
Approved by: https://github.com/iseeyuan
2025-09-10 01:53:26 +00:00
c66e58b7d0 [ONNX] Expose the testing module (#162495)
* Created a new module `torch/onnx/testing.py` that exposes the `assert_onnx_program` function for testing exported ONNX models.
* Updated the ONNX documentation (`docs/source/onnx.md`) to include `onnx_testing` in the list of relevant modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162495
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2025-09-10 01:40:24 +00:00
878f59ef75 DeviceMesh: support _rank for use with non-global PGs (#162439)
Summary: This adds a `_rank` field to DeviceMesh init that allows for instantiating a DeviceMesh without depending on `dist.get_rank()` which requires a global PG to be instantiated.

Test Plan:
```
buck2 test mode/opt -c fbcode.enable_gpu_sections=true  //caffe2/test/distributed:device_mesh -- init_backend
```

Rollback Plan:

Differential Revision: D81981777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162439
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2025-09-10 01:18:28 +00:00
e60ad4f628 [DTensor] fix copy_ strategy to support linearity (#162460)
Fixing issue introduced in https://github.com/pytorch/pytorch/pull/158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.

In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.

This was discovered from silent incorrect results e.g. on `torch.einsum` backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162460
Approved by: https://github.com/zpcore
2025-09-10 00:47:14 +00:00
2281d009e5 Revert "[ROCm] Add specific compile options for CK SDPA (#161759)"
This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e.

Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726))
2025-09-10 00:44:30 +00:00
33589374b6 [DCP] Avoid multiple storage writer resets in async save (#159448)
Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path.

Test Plan:
```
buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test
```
https://www.internalfb.com/intern/testinfra/testrun/15199648841705052

Rollback Plan:

Differential Revision: D79230339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448
Approved by: https://github.com/meetv18
2025-09-10 00:43:03 +00:00
5539916fe1 [dynamo][refactor] Move get_framelocals_idx to a helper (#162519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162519
Approved by: https://github.com/williamwen42
2025-09-10 00:35:09 +00:00
e4174b1fd7 remove gso from collapse_view_helper (#162212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162212
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-09-10 00:17:15 +00:00
0e7ccc09db [easy] Don't force copy result of getAllOperatorsFor in init.cpp (#162218)
It returns a const reference to a vector.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162218
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220
2025-09-10 00:08:15 +00:00
87cc126457 [associative_scan] partial gradient support (#162388)
This PR tests the partial gradient support of the `associative_scan` operation. It replaces https://github.com/bohnstingl/pytorch/pull/6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162388
Approved by: https://github.com/ydwu4
2025-09-09 23:52:29 +00:00
a3e26d1727 Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670)"
This reverts commit e2545487de3dbbe663e3f0adb699547a14da0f6a.

Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing a trunk test ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3272626391))
2025-09-09 23:40:26 +00:00
d2393c2d7d [ROCm] Integrate AITER Fav3 fwd kernels (#160105)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160105
Approved by: https://github.com/jeffdaily
2025-09-09 22:30:12 +00:00
b498299953 154849 Add support to handle IGUSR1 and SIGUSR2 in multiprocessing (#160690)
Fixes #154849

This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments.  Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api

Tests:
For validations purpose:
test_signal_handling.py,
simple_test_api_signal_handling.py,

Unit Tests:
for launcher changes:launcher/test_api.py
for api changes:  multiprocessing/test_api.py
E2E: test_run.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690
Approved by: https://github.com/fduwjj
2025-09-09 22:23:06 +00:00
4d66a3b894 fix Dtensor doc link (#162494)
Small fix for https://docs.pytorch.org/docs/main/distributed.tensor.parallel.html
<img width="890" height="274" alt="image" src="https://github.com/user-attachments/assets/6ee7fc7c-e0fe-4f5e-ab7e-a895bb3fa79f" />

now it is:

<img width="909" height="320" alt="image" src="https://github.com/user-attachments/assets/8b2c41ef-1684-4597-8dae-144b49723796" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162494
Approved by: https://github.com/XilunWu
2025-09-09 22:10:37 +00:00
e2545487de [dynamo] Graph break on on user-defined class in compiled region (#161670)
Currently, user-defined classes inside of a compiled frame will cause the whole
frame to be skipped by dynamo.  This change defers the Unsupported exception
until the __build_class__ builtin is actually called, which allows a graph break
to be inserted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas
2025-09-09 21:07:49 +00:00
8922bbcaab Use same NVSHMEM version across CUDA builds (#162206)
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
2025-09-09 20:59:50 +00:00
14744e1ab2 [Release 2.9] Add compatibility matrix, Version Bump (#162526)
Release 2.9
1. Add release compatibility matrix
2. Add version bump for 2.10
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162526
Approved by: https://github.com/malfet
2025-09-09 20:38:15 +00:00
b477fb106f [ROCm] enable grouped gemm fallback (#162419)
Enables bf16 group gemm alternative path as described in #161366
Fast path will be enabled in future through CK integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 20:04:56 +00:00
d22d916719 [ROCm] Add specific compile options for CK SDPA (#161759)
Updates CK version and adds CK specific compilation options

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759
Approved by: https://github.com/jeffdaily
2025-09-09 20:04:19 +00:00
86d34a43f5 NamedTuple: Allow side effects for dynamic attributes (#161645)
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it.

The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple.

Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object.

Fixes #161610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645
Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
2025-09-09 19:42:02 +00:00
8508651477 Fix flaky AOTFxirTestCase (#162472)
Fixes https://github.com/pytorch/pytorch/issues/162357
Fixes https://github.com/pytorch/pytorch/issues/160970
Fixes https://github.com/pytorch/pytorch/issues/161038
Fixes https://github.com/pytorch/pytorch/issues/160951
Fixes https://github.com/pytorch/pytorch/issues/161698

These tests were introduced in https://github.com/pytorch/pytorch/pull/160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option).  The issue could be reproduced by running them locally multiple times.  For example,

```
pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add
(output logs at P1938386961)
...
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
================================================================================================================================================= short test summary info ==================================================================================================================================================
FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
=============================================================================================================================================== 3 failed, 7 passed in 12.89s ===============================================================================================================================================
```

Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from https://github.com/pytorch/pytorch/pull/160765.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162472
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-09-09 19:39:24 +00:00
723c27ed78 [standalone_compile] binary format write should be atomic (#162432)
We update it to call write_atomic instead of file.write

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162432
Approved by: https://github.com/oulgen
2025-09-09 18:43:13 +00:00
78b4d254aa Ready to land
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-09 11:42:27 -07:00
bdbe931d58 [build] Add LeakSanitizer option to CMake (#158686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158686
Approved by: https://github.com/eellison
2025-09-09 18:41:20 +00:00
af60398c3a Update the operator benchmarking, to benchmark using torch.compile (#161394)
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:

- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name

Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
2025-09-09 18:17:37 +00:00
82f1eb9b03 Revert "[MPS] mps sparse mul op implementation (#162349)"
This reverts commit 3ea686804925f1291de57ffdb3394da0b46deb54.

Reverted https://github.com/pytorch/pytorch/pull/162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](https://github.com/pytorch/pytorch/pull/162349#issuecomment-3271783442))
2025-09-09 18:14:16 +00:00
4b2d297eec python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580)
My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps:

(1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop.

(2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++)

I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated

This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us):

**aten.detach(plain_tensor)**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790>
_ = x.detach()
  4.80 us
  1 measurement, 100000 runs , 1 thread
```

**aten.detach(DTensor) [before this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750>
_ = x_dt.detach()
  336.40 us
  1 measurement, 1000 runs , 1 thread
```

**aten.detach(DTensor) [after this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520>
_ = x_dt.detach()
  Median: 133.45 us
  2 measurements, 1000 runs per measurement, 1 thread
```

benchmark script:
```
import torch
import torch.distributed as dist
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
import torch.utils.benchmark as benchmark

fake_store = FakeStore()
dist.init_process_group("fake", store=fake_store, rank=0, world_size=2)

mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,))
x = torch.randn(4, 4, requires_grad=True)
x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)

t0 = benchmark.Timer(
    stmt='_ = x_dt.detach()',
    globals={'x_dt': x_dt},
)
print(t0.blocked_autorange())

dist.destroy_process_group()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580
Approved by: https://github.com/ezyang
2025-09-09 18:04:56 +00:00
0ec723acd0 Update docs for quantile to be clearer for nearest (#162423)
Correct the rounding scheme for nearest in quantile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162423
Approved by: https://github.com/soulitzer
2025-09-09 18:04:12 +00:00
e1be887870 [PP] Add spacing to visualizer (#160474)
When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this:

<img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" />

While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this:

<img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" />

This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change:

<img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160474
Approved by: https://github.com/fegin
2025-09-09 17:52:52 +00:00
d91eecc9a5 [inductor][template heuristics] don't take layout to generate choices (#162238)
# why

- unnecessary as we only ever need to know the dtype and maybe the
  device
- we already take in the kernel inputs which have the device
- enable us to specify the layout after finding all the configs
  but before generating the ChoiceCallers

# what

- replace all calls in template_heuristics that used to take Layout
  with now just taking out_dtype

# testing

ci

Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162238
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348, #161349
2025-09-09 17:17:04 +00:00
24a4dae85b [inductor] V.choices.get_mm_configs override point (#161349)
# why

- enable us to override the default configs, or fall back to them
  through subclassing InductorChoices

# what

- override (private) function
- default implementationt takes the kernel template choice (ktc)
  generator for every template and just executes the generator
- future overrides can decide to replace those generators, or filter
  out choices

- the 2nd expensive step (maybe_append_choices, choice_or_none) is
  handled outside this function, in the main V.choices.get_mm_configs
  this means that any overriding benefits from not generating expensive
  templates that aren't going to be used

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161349
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348
2025-09-09 17:17:04 +00:00
d3c4cf838e [inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)
\# why

- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
  generators are executed, so we don't need this generator anymore

\# what

- rather than yielding the ChoiceCaller, just return the list of all
  valid ChoiceCallers

\# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #161347
2025-09-09 17:16:57 +00:00
b1e99c8c7a [inductor] add kernel template choice (ktc) (#161347)
# why

- gather everything up to make choices, without running
  potentially expensive generators
- enables overrides where we toss the entire list of configs
  from inductor, without having to enumrate it (expensive)

# what

- add a holding class that just gets all the components necessary
  to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
  the scene

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
2025-09-09 17:16:50 +00:00
5eb35d2ab8 [CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387
Approved by: https://github.com/Skylion007
2025-09-09 17:04:06 +00:00
f03d635dc6 [ROCm][CI] skip test_max_autotune until resolved (#162496)
many tests taking >30 min and causing timeouts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162496
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 16:34:01 +00:00
1f0b01d4b6 [ROCm] OffsetCalc Unroll Optimization (#161700)
Our compiler is generating inefficient code for the offsetCalc in certain situations.
The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161700
Approved by: https://github.com/jeffdaily
2025-09-09 16:11:48 +00:00
c0142f5c06 [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@fb.com>
2025-09-09 15:49:21 +00:00
3ea6868049 [MPS] mps sparse mul op implementation (#162349)
Implements mps sparse mul operation as well as enables other operations such as:
1. copy_
2. div
3. sum
4. floor
5. power
6. sub
7. floor_divide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349
Approved by: https://github.com/pearu, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-09 15:45:37 +00:00
be3b8d2ec9 [ROCm][CI] update fbgemm nightly benchmark hash (#162385)
fbgemm_gpu was failing to clone due to missing submodule commit.
```
+ pushd fbgemm/fbgemm_gpu
~/pytorch/fbgemm/fbgemm_gpu ~/pytorch
+ git checkout 7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8 --recurse-submodules
fatal: failed to unpack tree object b1281b8b08d973a7064f864f47eeb30f3e2596e9
error: Submodule 'external/composable_kernel' could not be updated.
error: Cannot update submodule:
	external/composable_kernel
```
Log File
[inductor-periodic · pytorch/pytorch@5babb4d](https://github.com/pytorch/pytorch/actions/runs/17536630806/job/49802458834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162385
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 15:44:39 +00:00
5ccf3ca3ec Revert "Use same NVSHMEM version across CUDA builds (#162206)"
This reverts commit 0d9c95cd7ee299e2e8c09df26d395be8775b506b.

Reverted https://github.com/pytorch/pytorch/pull/162206 on behalf of https://github.com/malfet due to Broke lint, see 4dd73e659a/1 ([comment](https://github.com/pytorch/pytorch/pull/162206#issuecomment-3271040521))
2025-09-09 14:40:45 +00:00
e38e953432 CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425)
Related to https://github.com/pytorch/pytorch/issues/162333
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425
Approved by: https://github.com/tinglvv, https://github.com/malfet
2025-09-09 14:40:34 +00:00
4dd73e659a Revert "fix torch.sparse.log_softmax on CPU (#161959)"
This reverts commit 002e59440afe8711019e68df500f5e18b9a43f3c.

Reverted https://github.com/pytorch/pytorch/pull/161959 on behalf of https://github.com/davidberard98 due to test failure: test_sparse.py::TestSparseMPS::test_log_softmax_float_mps_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17573794461/job/49915138287) [HUD commit link](002e59440a) ([comment](https://github.com/pytorch/pytorch/pull/161959#issuecomment-3270509418))
2025-09-09 12:33:25 +00:00
0d9c95cd7e Use same NVSHMEM version across CUDA builds (#162206)
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
2025-09-09 08:52:27 +00:00
dcc42e95f4 Fix missing moves in initJITBindings (#162428)
Per @Skylion007 on #162219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162428
Approved by: https://github.com/Skylion007
2025-09-09 08:47:33 +00:00
002e59440a fix torch.sparse.log_softmax on CPU (#161959)
Fix https://github.com/pytorch/pytorch/issues/152293.

**Example:**
```
import torch
from torch.sparse import log_softmax as sparse_log_softmax

def test_bug():
    a = torch.rand(4, 3)
    b = a - 10000000.0
    b_sparse = b.to_sparse()

    cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense()
    print('cpu_out_sparse =', cpu_out_sparse)

    b_sparse_double = b.double().to_sparse()
    cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense()
    print('cpu_out_sparse_double =', cpu_out_sparse_double)

if __name__ == '__main__':
    test_bug()
```

**Output:**

- before
```
cpu_out_sparse = tensor([[-2., -1., -2.],
        [-1., -1., -1.],
        [-1., -2., -2.],
        [-1., -1., -2.]])
cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514],
        [-1.0986, -1.0986, -1.0986],
        [-0.5514, -1.5514, -1.5514],
        [-0.8620, -0.8620, -1.8620]], dtype=torch.float64)
```

- after
```
cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]])
cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]], dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959
Approved by: https://github.com/Skylion007
2025-09-09 06:25:16 +00:00
4840a1a591 Run vLLM tests on all trunk commits before 2.9 branch cut (#161797)
This makes it easier to bisect issue now given that we don't have lots of time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161797
Approved by: https://github.com/yangw-dev
2025-09-09 05:56:41 +00:00
d49205fe1f Add more tests for vllm and clean out the old vllm test (#162292)
Test failure coverage from pytorch 2.8 release issues
[internal access only](https://docs.google.com/document/d/1zvK1eUAHubHGGHg9jKxd-QlP89fzgfqOBvE2m9mUs90/edit?tab=t.0
)

See coverage mapping
| Given test / pattern | Suite ID (from config) |
|---|---|
| pytest -v -s basic_correctness/test_cumem.py | vllm_basic_correctness_test |
| pytest -v -s entrypoints/openai/test_sleep.py | vllm_entrypoints_test |
| pytest -v -s entrypoints/openai/test_translation_validation.py::test_long_audio_request | vllm_entrypoints_test |
| pytest -v -s lora/test_quant_model.py | vllm_lora_28_failure_test |
| pytest -v -s -x tests/lora/test_llama_tp.py | vllm_lora_tp_test_distributed |
| pytest -v -s distributed/test_sequence_parallel.py -k test_tp_sp_generation |vllm_distributed_test_28_failure_test |
| pytest -v -s distributed/test_sequence_parallel.py::test_tp_sp_generation[...] | vllm_distributed_test_28_failure_test |
| pytest models/language/generation/test_mistral.py::test_models[...] | vllm_languagde_model_test_extended_generation_28_failure_test |
| pytest models/multimodal/pooling/test_jinavl_reranker.py::test_model_text_image[...] | vllm_multi_model_test_28_failure_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen25vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora_beam_search | vllm_lora_test |
| tests/lora/test_phi.py::test_phi2_lora | DIDN'T FIND IT IT IN VLLM |
| models/multimodal/generation/test_voxtral.py::test_models_with_multiple_audios[5-128-half] | vllm_multi_model_test_28_failure_test |
| models/test_initialization.py::test_can_initialize[VoxtralForConditionalGeneration] | vllm_basic_models_test |
| pytest -v -s -x lora/test_chatglm3_tp.py -k test_chatglm3_lora_tp4_fully_sharded_loras | vllm_lora_tp_test_distributed |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162292
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-09-09 05:53:46 +00:00
d85392a88e Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-09 05:42:19 +00:00
7feb8fc589 [SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available (#162142)
Summary:
As we have multiple backends, _SymmetricMemory should not be imported together with NVSHMEM related modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162142
Approved by: https://github.com/dcci, https://github.com/kwen2501
2025-09-09 05:37:43 +00:00
60d009267e Revert "testing infra and some fixes (#162183)"
This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d.

Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))
2025-09-09 05:26:32 +00:00
4590438329 [fx] fix qualified name for methods of torch.Tensor (#162407)
This fixes an error in the previous PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2025-09-09 05:14:43 +00:00
8494afb837 Add missing fstream include to fix std::ofstream compilation error (#162421)
## Summary
This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard *Google internal compile setup* (built with bazel).

## Details
The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers.

## Error message:
```
torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>'
8 | std::ofstream file(filename_, std::ios::binary);
| ^
libcxx/include/__fwd/fstream.h:26:7: note: template is declared here
26 | class basic_ofstream;
| ^
1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421
Approved by: https://github.com/ezyang
2025-09-09 05:14:32 +00:00
7ad40de60e [audio hash update] update the pinned audio hash (#162437)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162437
Approved by: https://github.com/pytorchbot
2025-09-09 04:41:34 +00:00
607327beae [vllm hash update] update the pinned vllm hash (#162356)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162356
Approved by: https://github.com/pytorchbot
2025-09-09 04:40:25 +00:00
f216d64bfe [SymmMem] Better tuning of A2AV based on accurate node boundary (#162003)
Use `world_within_direct_access()` to distinguish intra- vs inter- node.
Previously we assumed a fixed node size of 8, which is not true for NVL72.

Also added env var `TORCH_SYMMMEM_NBLOCKS` for control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162003
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-09-09 04:18:17 +00:00
847d7f21af [CUDA-13] Implement workaround for cudaErrorNotSupported (#162412)
See https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162412
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-09 04:12:10 +00:00
065c446193 [SymmMem] Use global pe for put and get (#162394)
NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel.

Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394
Approved by: https://github.com/ngimel, https://github.com/fegin
ghstack dependencies: #162320
2025-09-09 03:58:48 +00:00
98ecc0f374 [SymmMem] Add team pool to hold duplicated teams for the same rank group (#162320)
When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets).

This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162320
Approved by: https://github.com/ngimel
2025-09-09 03:58:48 +00:00
4c45090cf7 [DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798)
Fixes #159590

This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).

This also handles cases such as #159590  where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
2025-09-09 03:52:05 +00:00
1641606aa4 Revert "Add BundledAOTAutogradSerializableCallable (#162170)"
This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a.

Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))
2025-09-09 03:33:36 +00:00
7b8a64557d [inductor] fix 3d tiled online softmax (#162341)
The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162311
2025-09-09 02:59:52 +00:00
d8b6622bb6 testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162167
2025-09-09 02:42:11 +00:00
8d5240d846 Fix lint
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-08 19:29:29 -07:00
135db45c9c Use more memory to build 0.13.0 torchao
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-08 19:12:37 -07:00
a965f09793 [export] Update PT2 archive docs (#162308)
Summary: Minor updates based on the recent refactoring for weight saving and loading

Test Plan:
doc change only

Rollback Plan:

Differential Revision: D81821994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162308
Approved by: https://github.com/angelayi
2025-09-09 02:08:13 +00:00
583bbf7761 [MPS] Add native_dropout and native_dropout_backward (#162108)
Fixes #162002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108
Approved by: https://github.com/malfet
2025-09-09 01:44:06 +00:00
e025c0f459 Dynamo: set_eval_frame microoptimization (#162220)
Optimize for common case and remove a pair of refcount operations (see new comments.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162220
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219
2025-09-09 01:10:06 +00:00
a8a187b2cf Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef (#162219)
Avoids requiring vector allocation to call this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162219
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692
2025-09-09 01:10:06 +00:00
12db2a7889 Call checkLong in is_int_or_symint, completing TODO (#161692)
Calling this first minimizes overhead for plain old ints, making cheap things cheap.

Differential Revision: [D81530098](https://our.internmc.facebook.com/intern/diff/D81530098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161692
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634
2025-09-09 01:10:06 +00:00
eab2afeff7 fastpath type Tensor in THPVariable_NewWithVar (#161634)
It is cheap to do an exact check against Tensor and much faster when it works (PyType_IsSubtype does not have this fastpath, I checked [source](9ee0214b5d/Objects/typeobject.c (L2889))). Spot-checked in perf on detach-DTensor-in-a-loop benchmark; small win but clear.

Differential Revision: [D81530101](https://our.internmc.facebook.com/intern/diff/D81530101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161634
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #161591, #161595, #161633
2025-09-09 01:10:06 +00:00
a951f435fd Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function (#161633)
py::args::size() calls PyTuple_GetSize. Compiler can't know the two calls will always return the same result, so we have to consolidate them ourselves.

Differential Revision: [D81530096](https://our.internmc.facebook.com/intern/diff/D81530096)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161633
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595
2025-09-09 01:10:06 +00:00
6eb14ac60f [Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447)
This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior.

Fixes: #140457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447
Approved by: https://github.com/mlazos
2025-09-09 01:07:02 +00:00
ed77e23b68 Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482)"
This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7.

Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))
2025-09-09 00:21:05 +00:00
897c4e70a7 Move to small wheel approach for CUDA SBSA wheel (#160720)
https://github.com/pytorch/pytorch/issues/160673

Use download.pytorch.org's dependencies like x86 build instead of bundling libs into the wheel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160720
Approved by: https://github.com/atalman
2025-09-09 00:18:43 +00:00
8485aac873 [precompile] Fix inlined source tracking with generators. (#162389)
Summary:
When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location.

In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field.

Test Plan:
test_package.py -k test_code_with_generator

Rollback Plan:

Differential Revision: D81929751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389
Approved by: https://github.com/dolpm, https://github.com/hrithick-codes
2025-09-09 00:13:54 +00:00
c0fc86b511 Fix aarch64 wheel pack (#159481)
PR that introduced the change: https://github.com/pytorch/builder/pull/1775
Use wheel pack instead of zip to repack the wheel.
It should regenerate the RECORD file and update all the hashes correctly.

TODO:
Apply wheel pack instead of zip to Rest of builds
Add validation test to make sure wheel contents matches RECORD file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159481
Approved by: https://github.com/malfet
2025-09-08 23:36:50 +00:00
07f07309c6 [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/huydhn
2025-09-08 23:30:11 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
5fd6b6a2db [refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189)
## Summary
- document guard behavior in `SizeVarAllocator.is_size_one`
- use `is_size_one` for broadcast/expand checks.
- This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)`

a4f9132a17/torch/_inductor/sizevars.py (L450-L453)

------
https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189
Approved by: https://github.com/laithsakka
2025-09-08 22:48:16 +00:00
ac9ccd0dc2 Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-08 22:44:48 +00:00
711c8c821e shape guards (#161178)
Summary: This PR introduces shape guards to export. Previously only value ranges,  equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program.

Test Plan:
updated several tests

Rollback Plan:

Differential Revision: D80713603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178
Approved by: https://github.com/tugsbayasgalan
2025-09-08 22:44:09 +00:00
2c538c9acf rewrite __maybe_broadcast should_expand check for unbacked (#162109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162109
Approved by: https://github.com/aorenste
ghstack dependencies: #162084, #162099
2025-09-08 22:41:18 +00:00
85fe94e933 make should_swap more dde friendly (#162099)
unblock customers for common cases with DDE ,until @pianpwk  land the change to should_swap https://github.com/pytorch/pytorch/pull/160473.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162099
Approved by: https://github.com/aorenste
ghstack dependencies: #162084
2025-09-08 22:41:18 +00:00
fecd9686f5 Graph split event tracker (#159795)
Summary:
A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs.

Usage: use env var to specify a mode and necessary arguments.

FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode.
```
Different modes of the event tracker:
"0": Tracker not enabled (by default)
"1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb.
"2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt
"3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt
"4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt
```
FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`.
FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3".

Test Plan: New unit test

Reviewed By: georgiaphillips

Differential Revision: D79203595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795
Approved by: https://github.com/ezyang
2025-09-08 21:30:17 +00:00
dd44faa9d9 Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#162103)"
This reverts commit 0af70e2353e1dcda83175fd4834ecb7b63e009e0.

Reverted https://github.com/pytorch/pytorch/pull/162103 on behalf of https://github.com/jithunnair-amd due to Cirrascale network outage resolved. Reverting back to running per commit to aid in triage and CI health ([comment](https://github.com/pytorch/pytorch/pull/162103#issuecomment-3267977825))
2025-09-08 20:53:05 +00:00
5d819f3faf Revert "[associative_scan] Autograd separated (#139939)"
This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8.

Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))
2025-09-08 20:42:47 +00:00
015423bef8 Add fp16-overflow regression test (#162401)
Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-08 20:33:23 +00:00
26a1b9cce2 [dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318)
Fixes https://github.com/pytorch/pytorch/issues/162313

Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305
2025-09-08 20:26:24 +00:00
8f114650eb Add std::any_of to ConvParams struct (#162334)
Removes some for-loops that didn't short-circuit in favor of std::any_of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162334
Approved by: https://github.com/Skylion007
2025-09-08 20:12:20 +00:00
ec2c1371af [BE]: Update cudnn frontend submodule to 1.14.1 (#162347)
Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-08 20:03:23 +00:00
8ec01f34e9 [bucketing] custom_ops mode to hide inductor copies overhead (#161499)
Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each,
to workaround too many generated kernels on inductor side.

This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems.

Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499
Approved by: https://github.com/eellison
2025-09-08 20:03:08 +00:00
9c991b63ff [CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 build (#162364)
https://github.com/pytorch/pytorch/issues/159779

Add the full CUDA support matrix to sbsa build (12.6, 12.8)
Same arch support as x86 build
Remove 12.9 sbsa build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162364
Approved by: https://github.com/atalman
2025-09-08 20:00:25 +00:00
8139b6b1b1 Test torchao build
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-08 02:43:16 -07:00
24c95d83e6 Bump torchao pinned commit
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-07 22:11:18 -07:00
21a34fa017 Merge branch 'main' into install-torchao-0.13.0 2025-09-07 22:06:33 -07:00
636d3aa00f Tiny comment update
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-06 23:13:43 -07:00
174f2faa8c Put torchao (0.13.0) back to benchmark workflow
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-09-04 17:26:03 -07:00
715 changed files with 9670 additions and 3129 deletions

View File

@ -3,12 +3,13 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
# Set CUDA architecture lists to match x86 build_cuda.sh
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
fi
if [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi
# Compress the fatbin with -compress-mode=size for CUDA 13
@ -27,14 +28,22 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
# Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling CUDA libraries with wheel for aarch64."
else
echo "Using nvidia libs from pypi for aarch64."
echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"
export USE_NVIDIA_PYPI_LIBS=1
fi
python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -13,49 +13,6 @@ def list_dir(path: str) -> list[str]:
return check_output(["ls", "-1", path]).decode().split("\n")
def build_ArmComputeLibrary() -> None:
"""
Using ArmComputeLibrary for aarch64 PyTorch
"""
print("Building Arm Compute Library")
acl_build_flags = [
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")
if os.path.isdir(acl_install_dir):
shutil.rmtree(acl_install_dir)
if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src", "build"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
@ -69,83 +26,186 @@ def replace_tag(filename) -> None:
f.writelines(lines)
def patch_library_rpath(
folder: str,
lib_name: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Apply patchelf to set RPATH for a library in torch/lib"""
lib_path = f"{folder}/tmp/torch/lib/{lib_name}"
if use_nvidia_pypi_libs:
# For PyPI NVIDIA libraries, construct CUDA RPATH
cuda_rpaths = [
"$ORIGIN/../../nvidia/cudnn/lib",
"$ORIGIN/../../nvidia/nvshmem/lib",
"$ORIGIN/../../nvidia/nccl/lib",
"$ORIGIN/../../nvidia/cusparselt/lib",
]
if "130" in desired_cuda:
cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")
else:
cuda_rpaths.extend(
[
"$ORIGIN/../../nvidia/cublas/lib",
"$ORIGIN/../../nvidia/cuda_cupti/lib",
"$ORIGIN/../../nvidia/cuda_nvrtc/lib",
"$ORIGIN/../../nvidia/cuda_runtime/lib",
"$ORIGIN/../../nvidia/cufft/lib",
"$ORIGIN/../../nvidia/curand/lib",
"$ORIGIN/../../nvidia/cusolver/lib",
"$ORIGIN/../../nvidia/cusparse/lib",
"$ORIGIN/../../nvidia/nvtx/lib",
"$ORIGIN/../../nvidia/cufile/lib",
]
)
# Add $ORIGIN for local torch libs
rpath = ":".join(cuda_rpaths) + ":$ORIGIN"
else:
# For bundled libraries, just use $ORIGIN
rpath = "$ORIGIN"
if os.path.exists(lib_path):
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"
)
def copy_and_patch_library(
src_path: str,
folder: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Copy a library to torch/lib and patch its RPATH"""
if os.path.exists(src_path):
lib_name = os.path.basename(src_path)
shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# Delete original wheel since it will be repackaged
os.system(f"rm {wheel_path}")
# CUDA version-specific libraries
if "130" in desired_cuda:
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.0",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
# Check if we should use PyPI NVIDIA libraries or bundle system libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Using nvidia libs from pypi - skipping CUDA library bundling")
# For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages
# We only need to bundle non-NVIDIA libraries
minimal_libs_to_copy = [
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy minimal libraries to unzipped_folder/torch/lib
for lib_path in minimal_libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
# Patch torch libraries used for searching libraries
torch_libs_to_patch = [
"libtorch.so",
"libtorch_cpu.so",
"libtorch_cuda.so",
"libtorch_cuda_linalg.so",
"libtorch_global_deps.so",
"libtorch_python.so",
"libtorch_nvshmem.so",
"libc10.so",
"libc10_cuda.so",
"libcaffe2_nvrtc.so",
"libshm.so",
]
for lib_name in torch_libs_to_patch:
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
else:
print("Bundling CUDA libraries with wheel")
# Original logic for bundling system CUDA libraries
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# CUDA version-specific libraries
if "13" in desired_cuda:
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
]
else:
raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy libraries to unzipped_folder/torch/lib
for lib_path in libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
@ -153,14 +213,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
replace_tag(f"{f.path}/WHEEL")
break
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
f"{folder}/cuda_wheel/{wheelname}",
f"{folder}/{wheelname}",
copy_function=shutil.copy2,
)
os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")
os.system(f"wheel pack {folder}/tmp/ -d {folder}")
os.system(f"rm -rf {folder}/tmp/")
def complete_wheel(folder: str) -> str:
@ -183,14 +237,7 @@ def complete_wheel(folder: str) -> str:
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = wheel_name.replace(
"linux_aarch64", "manylinux_2_28_aarch64"
)
print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")
os.rename(
f"/{folder}/dist/{wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
repaired_wheel_name = list_dir(f"/{folder}/dist")[0]
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
@ -227,11 +274,21 @@ if __name__ == "__main__":
).decode()
print("Building PyTorch wheel")
build_vars = "CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
build_vars = ""
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars += "MAX_JOBS=5 "
# Handle PyPI NVIDIA libraries vs bundled libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Configuring build for PyPI NVIDIA libraries")
# Configure for dynamic linking (matching x86 logic)
build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "
else:
print("Configuring build for bundled NVIDIA libraries")
# Keep existing static linking approach - already configured above
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
if override_package_version is not None:
@ -256,19 +313,13 @@ if __name__ == "__main__":
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()
print("build pytorch with mkldnn+acl backend")
build_vars += (
"USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
"ACL_ROOT_DIR=/acl "
"LD_LIBRARY_PATH=/pytorch/build/lib:/acl/build:$LD_LIBRARY_PATH "
"ACL_INCLUDE_DIR=/acl/build "
"ACL_LIBRARY=/acl/build "
)
build_vars += "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
build_vars += "ACL_ROOT_DIR=/acl "
if enable_cuda:
build_vars += "BLAS=NVPL "
else:
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/OpenBLAS "
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/opt/OpenBLAS "
else:
print("build pytorch without mkldnn backend")

View File

@ -299,40 +299,6 @@ def install_condaforge_python(host: RemoteHost, python_version="3.8") -> None:
)
def build_OpenBLAS(host: RemoteHost, git_clone_flags: str = "") -> None:
print("Building OpenBLAS")
host.run_cmd(
f"git clone https://github.com/xianyi/OpenBLAS -b v0.3.28 {git_clone_flags}"
)
make_flags = "NUM_THREADS=64 USE_OPENMP=1 NO_SHARED=1 DYNAMIC_ARCH=1 TARGET=ARMV8"
host.run_cmd(
f"pushd OpenBLAS && make {make_flags} -j8 && sudo make {make_flags} install && popd && rm -rf OpenBLAS"
)
def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None:
print("Building Arm Compute Library")
acl_build_flags = " ".join(
[
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
)
host.run_cmd(
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"
)
host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")
def embed_libgomp(host: RemoteHost, use_conda, wheel_name) -> None:
host.run_cmd("pip3 install auditwheel")
host.run_cmd(
@ -700,7 +666,6 @@ def start_build(
configure_system(
host, compiler=compiler, use_conda=use_conda, python_version=python_version
)
build_OpenBLAS(host, git_clone_flags)
if host.using_docker():
print("Move libgfortant.a into a standard location")
@ -723,6 +688,8 @@ def start_build(
f"git clone --recurse-submodules -b {branch} https://github.com/pytorch/pytorch {git_clone_flags}"
)
host.run_cmd("pytorch/.ci/docker/common/install_openblas.sh")
print("Building PyTorch wheel")
build_opts = ""
if pytorch_build_number is not None:
@ -743,15 +710,18 @@ def start_build(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:
build_ArmComputeLibrary(host, git_clone_flags)
host.run_cmd("pytorch/.ci/docker/common/install_acl.sh")
print("build pytorch with mkldnn+acl backend")
build_vars += " USE_MKLDNN=ON USE_MKLDNN_ACL=ON"
build_vars += " BLAS=OpenBLAS"
build_vars += " OpenBLAS_HOME=/opt/OpenBLAS"
build_vars += " ACL_ROOT_DIR=/acl"
host.run_cmd(
f"cd $HOME/pytorch && export ACL_ROOT_DIR=$HOME/ComputeLibrary && {build_vars} python3 setup.py bdist_wheel{build_opts}"
f"cd $HOME/pytorch && {build_vars} python3 setup.py bdist_wheel{build_opts}"
)
print("Repair the wheel")
pytorch_wheel_name = host.list_dir("pytorch/dist")[0]
ld_library_path = "$HOME/acl/build:$HOME/pytorch/build/lib"
ld_library_path = "/acl/build:$HOME/pytorch/build/lib"
host.run_cmd(
f"export LD_LIBRARY_PATH={ld_library_path} && auditwheel repair $HOME/pytorch/dist/{pytorch_wheel_name}"
)
@ -907,7 +877,7 @@ def terminate_instances(instance_type: str) -> None:
def parse_arguments():
from argparse import ArgumentParser
parser = ArgumentParser("Builid and test AARCH64 wheels using EC2")
parser = ArgumentParser("Build and test AARCH64 wheels using EC2")
parser.add_argument("--key-name", type=str)
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")

View File

@ -56,9 +56,13 @@ ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
RUN mkdir ci_commit_pins
COPY ./common/common_utils.sh common_utils.sh
COPY ./ci_commit_pins/rocm-composable-kernel.txt ci_commit_pins/rocm-composable-kernel.txt
COPY ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
RUN rm install_rocm.sh common_utils.sh
RUN rm -r ci_commit_pins
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}
RUN rm install_rocm_magma.sh

View File

@ -0,0 +1 @@
7fe50dc3da2069d6645d9deb8c017a876472a977

27
.ci/docker/common/install_acl.sh Normal file → Executable file
View File

@ -1,16 +1,27 @@
set -euo pipefail
#!/bin/bash
# Script used only in CD pipeline
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary
set -eux
ACL_VERSION=${ACL_VERSION:-"v25.02"}
ACL_INSTALL_DIR="/acl"
# Clone ACL
[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git
cd ${src_repo}
git checkout $version
git clone https://github.com/ARM-software/ComputeLibrary.git -b "${ACL_VERSION}" --depth 1 --shallow-submodules
ACL_CHECKOUT_DIR="ComputeLibrary"
# Build with scons
pushd $ACL_CHECKOUT_DIR
scons -j8 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \
os=linux arch=armv8a build=native multi_isa=1 \
fixed_format_kernels=1 openmp=1 cppthreads=0
popd
# Install ACL
sudo mkdir -p ${ACL_INSTALL_DIR}
for d in arm_compute include utils support src build
do
sudo cp -r ${ACL_CHECKOUT_DIR}/${d} ${ACL_INSTALL_DIR}/${d}
done
rm -rf $ACL_CHECKOUT_DIR

12
.ci/docker/common/install_openblas.sh Normal file → Executable file
View File

@ -3,8 +3,10 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules
OPENBLAS_VERSION=${OPENBLAS_VERSION:-"v0.3.30"}
# Clone OpenBLAS
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION}" --depth 1 --shallow-submodules
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
OPENBLAS_BUILD_FLAGS="
@ -17,5 +19,7 @@ CFLAGS=-O3
BUILD_BFLOAT16=1
"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} -C $OPENBLAS_CHECKOUT_DIR
sudo make install -C $OPENBLAS_CHECKOUT_DIR
rm -rf $OPENBLAS_CHECKOUT_DIR

View File

@ -2,6 +2,11 @@
set -ex
# for pip_install function
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
ROCM_COMPOSABLE_KERNEL_VERSION="$(cat $(dirname $0)/../ci_commit_pins/rocm-composable-kernel.txt)"
ver() {
printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');
}
@ -113,6 +118,8 @@ EOF
rm -rf HIP clr
fi
pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@ -176,6 +183,8 @@ install_centos() {
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -62,6 +62,13 @@ ARG OPENBLAS_VERSION
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
# Install Arm Compute Library
FROM base as arm_compute
# use python3.9 to install scons
RUN python3.9 -m pip install scons==4.7.0
RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
COPY ./common/install_acl.sh install_acl.sh
RUN bash ./install_acl.sh && rm install_acl.sh
FROM base as final
# remove unnecessary python versions
@ -70,4 +77,5 @@ RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
COPY --from=arm_compute /acl /acl
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:/acl/build/:$LD_LIBRARY_PATH

View File

@ -28,6 +28,7 @@ fi
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
OPENBLAS_VERSION=${OPENBLAS_VERSION:-}
ACL_VERSION=${ACL_VERSION:-}
case ${image} in
manylinux2_28-builder:cpu)
@ -41,7 +42,6 @@ case ${image} in
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.30"
;;
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
@ -121,7 +121,8 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION:-}" \
--build-arg "ACL_VERSION=${ACL_VERSION:-}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \

View File

@ -52,9 +52,13 @@ ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
RUN mkdir ci_commit_pins
COPY ./common/common_utils.sh common_utils.sh
COPY ./ci_commit_pins/rocm-composable-kernel.txt ci_commit_pins/rocm-composable-kernel.txt
COPY ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
RUN rm install_rocm.sh common_utils.sh
RUN rm -r ci_commit_pins
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}
RUN rm install_rocm_magma.sh

View File

@ -96,14 +96,24 @@ def sample_vllm_test_library():
"num_gpus": 4,
"steps": [
"pytest -v -s -x lora/test_chatglm3_tp.py",
"echo $VLLM_WORKER_MULTIPROC_METHOD",
"pytest -v -s -x lora/test_llama_tp.py",
"pytest -v -s -x lora/test_multi_loras_with_tp.py",
"pytest -v -s -x lora/test_llm_with_multi_loras.py",
],
},
"vllm_lora_280_failure_test": {
"title": "LoRA 280 failure test",
"id": "vllm_lora_280_failure_test",
"vllm_distributed_test_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
"vllm_lora_28_failure_test": {
"title": "LoRA pytorch 2.8 failure test",
"id": "vllm_lora_28_failure_test",
"steps": ["pytest -v lora/test_quant_model.py"],
},
"vllm_multi_model_processor_test": {
@ -114,6 +124,15 @@ def sample_vllm_test_library():
"pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py",
],
},
"vllm_multi_model_test_28_failure_test": {
"title": "Multi-Model Test (Failed 2.8 release)",
"id": "vllm_multi_model_test_28_failure_test",
"package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],
"steps": [
"pytest -v -s models/multimodal/generation/test_voxtral.py",
"pytest -v -s models/multimodal/pooling",
],
},
"vllm_pytorch_compilation_unit_tests": {
"title": "PyTorch Compilation Unit Tests",
"id": "vllm_pytorch_compilation_unit_tests",
@ -128,6 +147,28 @@ def sample_vllm_test_library():
"pytest -v -s compile/test_decorator.py",
],
},
"vllm_languagde_model_test_extended_generation_28_failure_test": {
"title": "Language Models Test (Extended Generation) 2.8 release failure",
"id": "vllm_languagde_model_test_extended_generation_28_failure_test",
"package_install": [
"--no-build-isolation",
"git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8",
],
"steps": [
"pytest -v -s models/language/generation/test_mistral.py",
],
},
"vllm_distributed_test_2_gpu_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_2_gpu_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
# TODO(elainewy):need to add g6 with 4 gpus to run this test
"vllm_lora_test": {
"title": "LoRA Test %N",

View File

@ -104,20 +104,26 @@ class VllmTestRunner(BaseRunner):
main function to run vllm test
"""
self.prepare()
with working_directory(self.work_directory):
if self.test_type == TestInpuType.TEST_PLAN:
if self.num_shards > 1:
run_test_plan(
self.test_plan,
"vllm",
sample_vllm_test_library(),
self.shard_id,
self.num_shards,
)
try:
with working_directory(self.work_directory):
if self.test_type == TestInpuType.TEST_PLAN:
if self.num_shards > 1:
run_test_plan(
self.test_plan,
"vllm",
sample_vllm_test_library(),
self.shard_id,
self.num_shards,
)
else:
run_test_plan(
self.test_plan, "vllm", sample_vllm_test_library()
)
else:
run_test_plan(self.test_plan, "vllm", sample_vllm_test_library())
else:
raise ValueError(f"Unknown test type {self.test_type}")
raise ValueError(f"Unknown test type {self.test_type}")
finally:
# double check the torches are not overridden by other packages
check_versions()
def _install_wheels(self, params: VllmTestParameters):
logger.info("Running vllm test with inputs: %s", params)

View File

@ -89,7 +89,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export USE_MKLDNN=1
export USE_MKLDNN_ACL=1
export ACL_ROOT_DIR=/ComputeLibrary
export ACL_ROOT_DIR=/acl
fi
if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then

View File

@ -35,10 +35,11 @@ fi
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
else
# NB: we always build with distributed; USE_DISTRIBUTED turns off all
# backends (specifically the gloo backend), so test that this case works too
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
fi
if which sccache > /dev/null; then

View File

@ -13,13 +13,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
python -mpip install -r requirements.txt
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
python -mpip install --no-input -r requirements.txt
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to
@ -181,9 +177,6 @@ checkout_install_torchbench() {
popd
pip install -r .ci/docker/ci_commit_pins/huggingface-requirements.txt
# https://github.com/pytorch/pytorch/issues/160689 to remove torchao because
# its current version 0.12.0 doesn't work with transformers 4.54.0
pip uninstall -y torchao
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze

View File

@ -778,11 +778,6 @@ test_single_dynamo_benchmark() {
}
test_inductor_micro_benchmark() {
# torchao requires cuda 8.0 or above for bfloat16 support
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;8.6"
fi
install_torchao
TEST_REPORTS_DIR=$(pwd)/test/test-reports
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
test_inductor_set_cpu_affinity
@ -1664,37 +1659,50 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then
elif [[ "${TEST_CONFIG}" == *all* ]]; then
TEST_MODE="all"
fi
test_operator_benchmark cpu ${TEST_MODE}
fi
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
install_torchao
fi
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then
test_inductor_triton_cpu
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
install_torchao
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
install_torchvision
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
install_torchao
fi
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark huggingface "$id"
elif [[ "${TEST_CONFIG}" == *timm* ]]; then
install_torchvision
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
install_torchao
fi
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == cachebench ]]; then
install_torchaudio
install_torchvision
install_torchao
PYTHONPATH=/torchbench test_cachebench
elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
install_torchaudio
install_torchvision
install_torchao
PYTHONPATH=/torchbench test_verify_cachebench
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
install_torchaudio
install_torchvision
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
install_torchao
fi
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
@ -1714,12 +1722,18 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchvision
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
install_torchao
fi
PYTHONPATH=/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
if [[ "$SHARD_NUMBER" -eq "1" ]]; then
test_inductor_aoti
fi
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
install_torchao
fi
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

View File

@ -1,9 +1,9 @@
set WIN_DRIVER_VN=528.89
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe" & REM @lint-ignore
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe
set WIN_DRIVER_VN=580.88
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe" & REM @lint-ignore
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe
if errorlevel 1 exit /b 1
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe -s -noreboot
if errorlevel 1 exit /b 1
del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL
del %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe || ver > NUL

View File

@ -189,8 +189,7 @@ pip install requests ninja typing-extensions
retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
retry brew install libomp
# For USE_DISTRIBUTED=1 on macOS, this enables gloo, which needs libuv, which
# is build as part of tensorpipe submodule
# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule
export USE_DISTRIBUTED=1
export USE_MKLDNN=OFF

View File

@ -1 +1 @@
3f90600fc287b276979ff2c8550a61d5d896bb8d
fa5142928ee157aa65137c4ecff2fe9b1a9e0648

View File

@ -1 +1 @@
7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8
08ae0af1395c8d8471f4025deb6af9aef90b342f

View File

@ -1 +1 @@
51c87b6ead6b7e098ada95d6a7609ee873b854cf
f32431e593d0e9db86c502d3872dd67ee40a005f

View File

@ -1 +1 @@
4172235ab78b09989fb56edaf734dbee283dda3e
cc99baf14dacc2497d0c5ed84e076ef2c37f6a4d

View File

@ -38,60 +38,60 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCHES = ["13.0-aarch64"]
CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64", "13.0-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"12.6": (
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "
"nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | "
"nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | "
"nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | "
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'"
),
"12.8": (
"nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | "
"nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "
"nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | "
"nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | "
"nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | "
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'"
),
"13.0": (
"nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | "
"nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | "
"nvidia-cublas==13.0.0.19; platform_system == 'Linux' | "
"nvidia-cufft==12.0.0.15; platform_system == 'Linux' | "
"nvidia-curand==10.4.0.35; platform_system == 'Linux' | "
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "
"nvidia-cufile==1.15.0.42; platform_system == 'Linux'"
),
"xpu": (
"intel-cmplr-lib-rt==2025.2.1 | "

91
.github/scripts/prepare_vllm_wheels.sh vendored Executable file
View File

@ -0,0 +1,91 @@
#!/usr/bin/env bash
set -eux
torch_version=$(unzip -p torch-* '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
nightly=$(echo ${torch_version} | cut -d'.' -f4)
# Copied from .ci/manywheel/build_common.sh
make_wheel_record() {
fpath=$1
if echo $fpath | grep RECORD >/dev/null 2>&1; then
echo "$fpath,,"
else
fhash=$(openssl dgst -sha256 -binary $fpath | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
fsize=$(ls -nl $fpath | awk '{print $5}')
echo "$fpath,sha256=$fhash,$fsize"
fi
}
change_wheel_version() {
local package=$1
local wheel=$2
local f_version=$3
local t_version=$4
# Extract the wheel
${PYTHON_EXECUTABLE} -mwheel unpack $wheel
mv "${package}-${f_version}" "${package}-${t_version}"
# Change the version from f_version to t_version in the dist-info dir
pushd "${package}-${t_version}"
mv "${package}-${f_version}.dist-info" "${package}-${t_version}.dist-info"
pushd "${package}-${t_version}.dist-info"
sed -i "s/${package}-${f_version}.dist-info/${package}-${t_version}.dist-info/g" RECORD
# Update the version in METADATA and its SHA256 hash
sed -i "s/Version: ${f_version}/Version: ${t_version}/g" METADATA
# then add PyTorch nightly dependency of vLLM
if [[ "${package}" == vllm ]] || [[ "${package}" == xformers ]]; then
sed -i "/License-File/a\Requires-Dist: torch==${torch_version}" METADATA
fi
sed -i '/METADATA,sha256/d' RECORD
popd
make_wheel_record "${package}-${t_version}.dist-info/METADATA" >> "${package}-${t_version}.dist-info/RECORD"
popd
# Repack the wheel
${PYTHON_EXECUTABLE} -mwheel pack "${package}-${t_version}"
# Clean up
rm -rf "${package}-${t_version}"
}
repackage_wheel() {
local package=$1
pushd $package
local orig_wheel=$(find . -name *${package//-/_}*)
local orig_version=$(unzip -p $orig_wheel '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
local version=""
if [[ "${package}" == vllm ]]; then
# Copied from vllm/.buildkite/scripts/upload-wheels.sh
version=1.0.0
else
version=$(echo $orig_version | tr '.+' '.' | cut -d'.' -f1-3)
fi
local nightly_version=$version.$nightly
# Use nightly version
change_wheel_version ${package//-/_} $orig_wheel $orig_version $nightly_version
# Clean up
rm "${orig_wheel}"
auditwheel repair --plat $PLATFORM *.whl \
--exclude libc10* --exclude libtorch* --exclude libcu* --exclude libnv*
local repair_wheel=$(find wheelhouse -name *${PLATFORM}*)
local repair_wheel=$(basename ${repair_wheel})
popd
cp ${package}/wheelhouse/${repair_wheel} .
rm -rf $package
}
pushd externals/vllm/wheels
for package in xformers flashinfer-python vllm; do
repackage_wheel $package
done
popd

View File

@ -47,12 +47,11 @@ jobs:
matrix:
include: [
{ name: "manylinux2_28-builder", tag: "cuda13.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.9", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.8", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.6", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda13.0", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.9", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.3", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cpu", runner: "linux.9xlarge.ephemeral" },

View File

@ -59,20 +59,6 @@ jobs:
run: |
set -eux
# Keep PyTorch nightly wheel here so that we can install it later during
# vLLM build process
mkdir -p "${RUNNER_TEMP}/artifacts/"
container_name=$(docker run \
--tty \
--detach \
-e PLATFORM \
-v "${GITHUB_WORKSPACE}:/pytorch" \
-v "${RUNNER_TEMP}/artifacts:/artifacts" \
-w /artifacts/ \
"${MANYLINUX_IMAGE}"
)
# Determine python executable for given version (copied from build-triton-wheel)
case $PY_VERS in
3.10)
@ -102,6 +88,21 @@ jobs:
;;
esac
# Keep PyTorch nightly wheel here so that we can install it later during
# vLLM build process
mkdir -p "${RUNNER_TEMP}/artifacts/"
container_name=$(docker run \
--tty \
--detach \
-e PLATFORM \
-e PYTHON_EXECUTABLE="${PYTHON_EXECUTABLE}" \
-v "${GITHUB_WORKSPACE}:/pytorch" \
-v "${RUNNER_TEMP}/artifacts:/artifacts" \
-w /artifacts/ \
"${MANYLINUX_IMAGE}"
)
docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" -mpip install \
--pre torch torchvision torchaudio \
--index-url "https://download.pytorch.org/whl/nightly/${BUILD_DEVICE}"
@ -113,7 +114,6 @@ jobs:
--index-url "https://download.pytorch.org/whl/nightly/${BUILD_DEVICE}"
# Save this for later
echo "PYTHON_EXECUTABLE=${PYTHON_EXECUTABLE}" >> "$GITHUB_ENV"
echo "container_name=${container_name}" >> "$GITHUB_ENV"
- name: Build vLLM wheel
@ -131,36 +131,7 @@ jobs:
set -eux
# Get these wheels ready, the vllm renaming logic is copied from its .buildkite/scripts/upload-wheels.sh
docker exec -t "${container_name}" bash -c "
set -eux
nightly=\$(unzip -p torch-* '**/METADATA' | grep '^Version: ' | cut -d' ' -f2 | cut -d'.' -f4)
pushd externals/vllm/wheels
for package in xformers flashinfer-python vllm; do
pushd \$package
auditwheel repair --plat \$PLATFORM *.whl \
--exclude libc10* --exclude libtorch* --exclude libcu* --exclude libnv*
repair_wheel=\$(find wheelhouse -name *\${PLATFORM}*)
repair_wheel=\$(basename \${repair_wheel})
popd
cp \${package}/wheelhouse/\${repair_wheel} .
version=\$(unzip -p \$repair_wheel '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
if [[ \$package == vllm ]]; then
new_wheel=\${repair_wheel/\$version/1.0.0.\$nightly}
else
major_version=\$(echo \$version | tr '.+' '.' | cut -d'.' -f1-3)
new_wheel=\${repair_wheel/\$version/\$major_version.\$nightly}
fi
mv -- \$repair_wheel \$new_wheel
rm -rf \$package
done
popd
"
docker exec -t "${container_name}" bash -c /pytorch/.github/scripts/prepare_vllm_wheels.sh
docker exec -t "${container_name}" chown -R 1000:1000 /artifacts
- uses: actions/upload-artifact@50769540e7f4bd5e21e526ee35c689e35e0d6874 # v4.4.0

View File

@ -112,6 +112,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -132,7 +224,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -223,6 +315,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -243,7 +427,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -334,6 +518,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -354,7 +630,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -445,6 +721,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -465,7 +833,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -556,6 +924,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -576,7 +1036,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -667,6 +1127,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -687,7 +1239,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -778,6 +1330,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14t-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14t-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -798,7 +1442,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -60,7 +60,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing

View File

@ -127,7 +127,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_6-test: # Testing
@ -193,7 +193,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing
@ -259,7 +259,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda13_0-test: # Testing
@ -719,7 +719,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-test: # Testing
@ -785,7 +785,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing
@ -851,7 +851,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda13_0-test: # Testing
@ -1311,7 +1311,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_6-test: # Testing
@ -1377,7 +1377,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
@ -1443,7 +1443,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda13_0-test: # Testing
@ -1903,7 +1903,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_6-test: # Testing
@ -1969,7 +1969,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing
@ -2035,7 +2035,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda13_0-test: # Testing
@ -2495,7 +2495,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_6-test: # Testing
@ -2561,7 +2561,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing
@ -2627,7 +2627,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda13_0-test: # Testing
@ -3087,7 +3087,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_6-test: # Testing
@ -3153,7 +3153,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_8-test: # Testing
@ -3219,7 +3219,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda13_0-test: # Testing
@ -3679,7 +3679,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_6-test: # Testing
@ -3745,7 +3745,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_8-test: # Testing
@ -3811,7 +3811,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda13_0-test: # Testing

View File

@ -35,6 +35,8 @@ jobs:
needs:
- get-default-label-prefix
with:
# More memory is needed to build torchao
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
@ -43,6 +45,7 @@ jobs:
{ include: [
{ config: "inductor-micro-benchmark", shard: 1, num_shards: 1, runner: "linux.aws.a100", owners: ["oncall:pt2"] },
]}
build-additional-packages: "vision audio fbgemm torchao"
secrets: inherit
test:

View File

@ -137,7 +137,6 @@ jobs:
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 720
# disable monitor in perf tests, next step is to enable it
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
@ -154,7 +153,6 @@ jobs:
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 1440
# disable monitor in perf tests, next step is to enable it
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
@ -173,7 +171,6 @@ jobs:
docker-image: ${{ needs.build.outputs.docker-image }}
test-matrix: ${{ needs.build.outputs.test-matrix }}
timeout-minutes: 720
# disable monitor in perf tests for more investigation
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4

View File

@ -36,6 +36,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
# More memory is needed to build torchao
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
@ -128,6 +130,8 @@ jobs:
needs:
- get-default-label-prefix
with:
# More memory is needed to build torchao
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks

View File

@ -3,18 +3,10 @@ name: inductor-rocm
on:
push:
branches:
#- main
- main
- release/*
tags:
- ciflow/inductor-rocm/*
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
- cron: 45 4,12,20 * * 1-5
- cron: 45 12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
workflow_dispatch:
concurrency:

View File

@ -33,6 +33,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build torchao
runner: linux.2xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
@ -45,6 +47,7 @@ jobs:
{ config: "inductor_cpp_wrapper", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
]}
build-additional-packages: "vision audio torchao"
secrets: inherit
inductor-test:

View File

@ -49,6 +49,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build torchao
runner: linux.2xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'

View File

@ -70,4 +70,5 @@ jobs:
build-environment: linux-noble-rocm-py3.12-mi300
docker-image: ${{ needs.linux-noble-rocm-py3_12-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-noble-rocm-py3_12-build.outputs.test-matrix }}
tests-to-include: "inductor/test_ck_backend"
secrets: inherit

View File

@ -3,19 +3,13 @@ name: rocm
on:
push:
branches:
# - main
- main
- release/*
tags:
- ciflow/rocm/*
workflow_dispatch:
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
- cron: 45 4,12,20 * * 1-5
- cron: 45 12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
- cron: 29 8 * * * # about 1:29am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

View File

@ -239,6 +239,8 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build torchao
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
@ -246,6 +248,7 @@ jobs:
{ include: [
{ config: "verify_cachebench", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
build-additional-packages: "vision audio torchao"
secrets: inherit
verify-cachebench-cpu-test:

View File

@ -2,6 +2,9 @@ name: vllm-test
on:
push:
branches:
- main
- release/*
tags:
- ciflow/vllm/*
workflow_dispatch:
@ -45,14 +48,18 @@ jobs:
{ config: "vllm_basic_models_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_entrypoints_test", shard: 1, num_shards: 1,runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_regression_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_280_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_multi_model_processor_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_pytorch_compilation_unit_tests", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_multi_model_test_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_languagde_model_test_extended_generation_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_distributed_test_2_gpu_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 0, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 1, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 2, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 3, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_tp_test_distributed", shard: 1, num_shards: 1, runner: "linux.aws.h100.4"},
{ config: "vllm_lora_tp_test_distributed", shard: 1, num_shards: 1, runner: "linux.g6.12xlarge.nvidia.gpu"},
{ config: "vllm_distributed_test_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.12xlarge.nvidia.gpu"}
]}
secrets: inherit

3
.gitignore vendored
View File

@ -259,6 +259,9 @@ gen
.pytest_cache
aten/build/*
# Linker scripts for prioritized text optimization
cmake/linker_script.ld
# Bram
plsdontbreak

View File

@ -22,6 +22,7 @@ COMMON_COPTS = [
"-DHAVE_SHM_UNLINK=1",
"-D_FILE_OFFSET_BITS=64",
"-DUSE_FBGEMM",
"-DUSE_DISTRIBUTED",
"-DAT_PER_OPERATOR_HEADERS",
"-DATEN_THREADING=NATIVE",
"-DNO_CUDNN_DESTROY_HANDLE",

View File

@ -181,9 +181,8 @@ elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(ppc64le)")
set(CPU_POWER ON)
endif()
# For non-supported platforms, turn USE_DISTRIBUTED off by default.
# NB: USE_DISTRIBUTED simply disables the backend; distributed code
# still gets built
# For non-supported platforms, turn USE_DISTRIBUTED off by default. It is not
# tested and likely won't work without additional changes.
if(NOT LINUX AND NOT WIN32)
set(USE_DISTRIBUTED
OFF
@ -234,6 +233,7 @@ cmake_dependent_option(INSTALL_TEST "Install test binaries if BUILD_TEST is on"
option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF)
option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON)
option(USE_ASAN "Use Address+Undefined Sanitizers" OFF)
option(USE_LSAN "Use Leak Sanitizer" OFF)
option(USE_TSAN "Use Thread Sanitizer" OFF)
option(USE_CUDA "Use CUDA" ON)
option(USE_XPU "Use XPU" ON)
@ -262,11 +262,11 @@ option(USE_PYTORCH_METAL "Use Metal for PyTorch iOS build" OFF)
option(USE_PYTORCH_METAL_EXPORT "Export Metal models on MacOSX desktop" OFF)
option(USE_NATIVE_ARCH "Use -march=native" OFF)
cmake_dependent_option(USE_MPS "Use MPS for macOS build" ON "MPS_FOUND" OFF)
option(USE_DISTRIBUTED "Enable default distributed backends" ON)
option(USE_DISTRIBUTED "Use distributed" ON)
cmake_dependent_option(USE_NCCL "Use NCCL" ON
"USE_DISTRIBUTED;USE_CUDA OR USE_ROCM;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_XCCL "Use XCCL" ON
"USE_DISTRIBUTED;USE_XPU;UNIX;NOT APPLE" OFF)
"USE_XPU;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON USE_NCCL OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON "USE_NCCL;NOT WIN32" OFF)
cmake_dependent_option(USE_STATIC_NCCL "Use static NCCL" OFF "USE_NCCL" OFF)
@ -379,6 +379,13 @@ cmake_dependent_option(BUILD_BUNDLE_PTXAS "Bundle PTX into torch/bin fodler"
OFF "USE_CUDA" OFF)
cmake_dependent_option(USE_KLEIDIAI "Use KleidiAI for the ARM CPU & AARCH64 architecture." ON
"CPU_AARCH64" OFF)
# prioritized text linker, ON by default for AArch64+Linux, option visible to all AArch64, x86 and ppc64le.
set(USE_PRIORITIZED_TEXT_DEFAULT OFF)
if(LINUX AND CPU_AARCH64)
set(USE_PRIORITIZED_TEXT_DEFAULT ON)
endif()
cmake_dependent_option(USE_PRIORITIZED_TEXT_FOR_LD "Use prioritized text linker for ld."
"${USE_PRIORITIZED_TEXT_DEFAULT}" "CPU_INTEL OR CPU_AARCH64 OR CPU_POWER" OFF)
option(USE_MIMALLOC "Use mimalloc" OFF)
# Enable third party mimalloc library to improve memory allocation performance
@ -431,10 +438,11 @@ if(WIN32)
PATH_SUFFIXES lib
NO_DEFAULT_PATH)
if(NOT libuv_tmp_LIBRARY)
set(USE_DISTRIBUTED OFF)
set(USE_GLOO OFF)
message(
WARNING
"Libuv is not installed in current conda env. Set USE_GLOO to OFF. "
"Libuv is not installed in current conda env. Set USE_DISTRIBUTED to OFF. "
"Please run command 'conda install -c conda-forge libuv=1.39' to install libuv."
)
else()
@ -656,6 +664,11 @@ endif(MSVC)
string(APPEND CMAKE_CUDA_FLAGS " -Xfatbin -compress-all")
# Set linker max-page-size to 64KiB on AArch64 Linux
if(LINUX AND CPU_AARCH64)
add_link_options_if_supported("-z,max-page-size=0x10000")
endif()
# Set INTERN_BUILD_MOBILE for all mobile builds. Components that are not
# applicable to mobile are disabled by this variable. Setting
# `BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN` environment variable can force it
@ -889,9 +902,9 @@ IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
set(USE_FBGEMM_GENAI off)
endif()
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0a")
message(WARNING "Setting USE_FBGEMM_GENAI to ON for CUDA build on SM100")
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100.
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
message(STATUS "Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI ON)
endif()
@ -1420,3 +1433,57 @@ if(BUILD_BUNDLE_PTXAS AND USE_CUDA)
install(PROGRAMS "${PROJECT_BINARY_DIR}/ptxas"
DESTINATION "${CMAKE_INSTALL_BINDIR}")
endif()
if(USE_PRIORITIZED_TEXT_FOR_LD)
add_compile_options(
$<$<COMPILE_LANGUAGE:C,CXX>:-ffunction-sections>
$<$<COMPILE_LANGUAGE:C,CXX>:-fdata-sections>
)
set(LINKER_SCRIPT_FILE_OUT "${CMAKE_SOURCE_DIR}/cmake/linker_script.ld")
set(LINKER_SCRIPT_FILE_IN "${CMAKE_SOURCE_DIR}/cmake/prioritized_text.txt")
add_custom_command(
OUTPUT "${LINKER_SCRIPT_FILE_OUT}"
COMMAND ${Python_EXECUTABLE} ${CMAKE_SOURCE_DIR}/tools/setup_helpers/generate_linker_script.py --filein "${LINKER_SCRIPT_FILE_IN}" --fout "${LINKER_SCRIPT_FILE_OUT}"
DEPENDS ${CMAKE_SOURCE_DIR}/tools/setup_helpers/generate_linker_script.py "${LINKER_SCRIPT_FILE_IN}"
COMMENT "Generating prioritized text linker files"
VERBATIM
)
add_custom_target(generate_linker_script DEPENDS "${LINKER_SCRIPT_FILE_OUT}")
if(BUILD_PYTHON)
set(LINKER_OPT_TARGETS torch_python)
endif()
if(NOT BUILD_LIBTORCHLESS)
list(APPEND LINKER_OPT_TARGETS torch_cpu c10)
if(USE_CUDA)
list(APPEND LINKER_OPT_TARGETS torch_cuda c10_cuda)
endif()
if(USE_XPU)
list(APPEND LINKER_OPT_TARGETS torch_xpu c10_xpu)
endif()
if(USE_ROCM)
list(APPEND LINKER_OPT_TARGETS torch_hip c10_hip)
endif()
endif()
foreach(tgt IN LISTS LINKER_OPT_TARGETS)
if(TARGET ${tgt})
add_dependencies("${tgt}" generate_linker_script)
target_link_options_if_supported(${tgt} "-T,${LINKER_SCRIPT_FILE_OUT}")
set_property(TARGET ${tgt} APPEND PROPERTY LINK_DEPENDS "${LINKER_SCRIPT_FILE_OUT}")
else()
message(WARNING "Requested target '${tgt}' for linker script optimization was not found.")
endif()
endforeach()
else()
if(LINUX AND CPU_AARCH64)
message(WARNING [[
It is strongly recommend to enable linker script optimization for all AArch64 Linux builds.
To do so please export USE_PRIORITIZED_TEXT_FOR_LD=1
]])
endif()
endif()

View File

@ -50,6 +50,7 @@ Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | C++ | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- | --- |
| 2.9 | >=3.10, <=(3.14, 3.14t experimental) | C++17 | CUDA 12.6 (CUDNN 9.10.2.21), CUDA 12.8 (CUDNN 9.10.2.21) | CUDA 13.0 (CUDNN 9.13.0.50) | ROCm 6.4 |
| 2.8 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 12.6 (CUDNN 9.10.2.21), CUDA 12.8 (CUDNN 9.10.2.21) | CUDA 12.9 (CUDNN 9.10.2.21) | ROCm 6.4 |
| 2.7 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 11.8 (CUDNN 9.1.0.70), CUDA 12.6 (CUDNN 9.5.1.17) | CUDA 12.8 (CUDNN 9.7.1.26) | ROCm 6.3 |
| 2.6 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 11.8, CUDA 12.4 (CUDNN 9.1.0.70) | CUDA 12.6 (CUDNN 9.5.1.17) | ROCm 6.2.4 |

View File

@ -265,6 +265,14 @@ IF(USE_FBGEMM_GENAI)
"${FBGEMM_GENAI_SRCS}/cutlass_extensions/**/*.cu")
list(FILTER fbgemm_genai_native_cuda_cu INCLUDE REGEX ${FBGEMM_CUTLASS_KERNELS_REGEX})
# PyTorch is not built for 10.0a in CI, due to lack of portability,
# so we need to explicitly build these files for 10.0a.
foreach(cu_file ${fbgemm_genai_native_cuda_cu})
_BUILD_FOR_ADDITIONAL_ARCHS(
"${cu_file}"
"100a")
endforeach()
file(GLOB_RECURSE fbgemm_genai_native_cuda_cpp
"${FBGEMM_GENAI_SRCS}/common/*.cpp"
)

View File

@ -133,12 +133,12 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
"resize_ called on tensor with symbolic shape")
TORCH_CHECK(
sparse_dim + dense_dim == static_cast<int64_t>(size.size()),
"number of dimensions must be sparse_dim (",
"'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = ",
size.size(),
", sparse_dim = ",
sparse_dim,
") + dense_dim (",
dense_dim,
"), but got ",
size.size());
", dense_dim = ",
dense_dim);
if (nnz() > 0) {
[[maybe_unused]] auto constexpr alt_options_msg =
"You could try the following options:\n\
@ -254,12 +254,12 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
"resize_and_clear_ called on tensor with symbolic shape")
TORCH_CHECK(
sparse_dim + dense_dim == static_cast<int64_t>(size.size()),
"number of dimensions must be sparse_dim (",
"'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = ",
size.size(),
", sparse_dim = ",
sparse_dim,
") + dense_dim (",
dense_dim,
"), but got ",
size.size());
", dense_dim = ",
dense_dim);
set_sizes_and_strides(size, std::vector<int64_t>(size.size()));
sparse_dim_ = sparse_dim;

View File

@ -64,6 +64,7 @@ constexpr DynamicTypeBits kDynamicClassTypeBit = DYNAMIC_TYPE_BIT(10);
_(ScalarType, kDynamicIntTypeBit, 1) \
_(Layout, kDynamicIntTypeBit, 1) \
_(SymInt, kDynamicIntTypeBit, 1) \
_(SymBool, kDynamicIntTypeBit, 1) \
_(MemoryFormat, kDynamicIntTypeBit, 1)
#define FORWARD_DECL_TYPE(NAME, _, __) struct NAME ## Type;

View File

@ -644,6 +644,8 @@ inline void bgemm_internal_cublas_half_helper(CUDABLAS_BGEMM_ARGTYPES_AND_C_DTYP
void * beta_ptr = &fbeta;
#ifdef USE_ROCM
int flag = 0;
rocblas_datatype c_type = std::is_same<C_Dtype, float>::value ? rocblas_datatype_f32_r : rocblas_datatype_f16_r;
rocblas_datatype d_type = c_type;
#if USE_GEMM_FLAGS_FP16_ALT_IMPL
flag = at::ROCmBackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
#endif
@ -652,8 +654,8 @@ inline void bgemm_internal_cublas_half_helper(CUDABLAS_BGEMM_ARGTYPES_AND_C_DTYP
hipOperationToRocOperation(opb), (int)m, (int)n, (int)k,
(void*)alpha_ptr, a, rocblas_datatype_f16_r, (int)lda, stridea,
b, rocblas_datatype_f16_r, (int)ldb, strideb,
(void*)beta_ptr, c, rocblas_datatype_f16_r, (int)ldc, stridec,
c, rocblas_datatype_f16_r, (int)ldc, stridec,
(void*)beta_ptr, c, c_type, (int)ldc, stridec,
c, d_type, (int)ldc, stridec,
(int) num_batches, rocblas_datatype_f32_r, rocblas_gemm_algo_standard,
0, flag)));
#else
@ -1096,6 +1098,8 @@ inline void gemm_internal_cublas_half_helper(CUDABLAS_GEMM_ARGTYPES_AND_C_DTYPE(
GEMM_CHECK_ARGVALUES(at::Half);
#ifdef USE_ROCM
int flag = 0;
rocblas_datatype c_type = std::is_same<C_Dtype, float>::value ? rocblas_datatype_f32_r : rocblas_datatype_f16_r;
rocblas_datatype d_type = c_type;
#if USE_GEMM_FLAGS_FP16_ALT_IMPL
flag = at::ROCmBackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
#endif
@ -1115,10 +1119,10 @@ inline void gemm_internal_cublas_half_helper(CUDABLAS_GEMM_ARGTYPES_AND_C_DTYPE(
ldb,
beta_ptr,
c,
rocblas_datatype_f16_r,
c_type,
ldc,
c,
rocblas_datatype_f16_r,
d_type,
ldc,
rocblas_datatype_f32_r,
rocblas_gemm_algo_standard,

View File

@ -45,6 +45,24 @@ struct OffsetCalculator {
C10_HOST_DEVICE offset_type get(index_t linear_idx) const {
offset_type offsets;
#if defined(USE_ROCM)
if ((dims > 0) && (dims <= 2)) {
auto divmod = sizes_[0].divmod(linear_idx);
#pragma unroll
for (int arg = 0; arg < NARGS; arg++)
offsets[arg] = divmod.mod * strides_[0][arg];
if (dims >= 2) {
divmod = sizes_[1].divmod(divmod.div);
#pragma unroll
for (int arg = 0; arg < NARGS; arg++)
offsets[arg] += divmod.mod * strides_[1][arg];
}
// [...]
return offsets;
}
#endif
#pragma unroll
for (int arg = 0; arg < NARGS; arg++) {
offsets[arg] = 0;

View File

@ -14,6 +14,7 @@
#include <c10/util/accumulate.h>
#include <c10/util/irange.h>
#include <c10/macros/Macros.h>
#include <algorithm>
#include <limits>
#include <utility>
@ -300,67 +301,50 @@ struct ConvParams {
bool allow_tf32{};
bool is_strided() const {
bool is_strided = false;
for (const auto& s : stride) {
is_strided |= (s != 1);
}
return is_strided;
return std::any_of(
stride.cbegin(), stride.cend(), [](const T& s) { return s != 1; });
}
bool is_dilated() const {
bool is_dilated = false;
for (const auto& d : dilation) {
is_dilated |= (d != 1);
}
return is_dilated;
return std::any_of(
dilation.cbegin(), dilation.cend(), [](const T& d) { return d != 1; });
}
bool is_padded() const {
bool is_padded = false;
for (auto p : padding) {
is_padded |= (p != 0);
}
return is_padded;
return std::any_of(
padding.cbegin(), padding.cend(), [](const T& p) { return p != 0; });
}
bool is_output_padding_neg() const {
bool is_non_neg = false;
for (const auto& p : output_padding) {
is_non_neg |= (p < 0);
}
return is_non_neg;
return std::any_of(
output_padding.cbegin(),
output_padding.cend(),
[](const T& p) { return p < 0; });
}
bool is_output_padding_big() const {
bool is_big = false;
// Revisit this with std::views::zip at C++20.
for (auto i: c10::irange(output_padding.size())) {
is_big |= (output_padding[i] >= stride[i]);
if (output_padding[i] >= stride[i]) {
return true;
}
}
return is_big;
return false;
}
bool is_padding_neg() const {
bool is_non_neg = false;
for (const auto& p : padding) {
is_non_neg |= (p < 0);
}
return is_non_neg;
return std::any_of(
padding.cbegin(), padding.cend(), [](const T& p) { return p < 0; });
}
bool is_dilation_neg() const {
bool is_non_neg = false;
for (const auto& p : dilation) {
is_non_neg |= (p < 0);
}
return is_non_neg;
return std::any_of(
dilation.cbegin(), dilation.cend(), [](const T& d) { return d < 0; });
}
bool is_stride_nonpos() const {
bool is_nonpos = false;
for (const auto& s : stride) {
is_nonpos |= (s <= 0);
}
return is_nonpos;
return std::any_of(
stride.cbegin(), stride.cend(), [](const T& s) { return s <= 0; });
}
void view1d_as_2d() {

View File

@ -18,6 +18,7 @@
#include <ATen/ops/is_set_to_native.h>
#include <ATen/ops/size_native.h>
#include <ATen/ops/stride_native.h>
#include <ATen/ops/sym_is_contiguous_native.h>
#include <ATen/ops/sym_numel_native.h>
#include <ATen/ops/sym_size_native.h>
#include <ATen/ops/sym_storage_offset_native.h>
@ -57,6 +58,12 @@ c10::SymInt sym_size(const Tensor& self, int64_t dim) {
return self.sym_size(dim);
}
c10::SymBool sym_is_contiguous(
const Tensor& self,
c10::MemoryFormat memory_format) {
return self.sym_is_contiguous(memory_format);
}
c10::SymInt sym_stride(const Tensor& self, int64_t dim) {
return self.sym_stride(dim);
}

View File

@ -1080,16 +1080,6 @@ static bool _scaled_mm_allowed_device(bool sm90_only=false, bool sm100_only=fals
#endif
}
static bool _grouped_mm_allowed_device() {
#ifdef USE_ROCM
return false;
#else
auto dprops = at::cuda::getCurrentDeviceProperties();
// CUDA capability 8.0 and greater
return dprops->major >= 8;
#endif
}
#ifdef USE_ROCM
static bool _scaled_mm_is_fnuz() {
return at::detail::getCUDAHooks().isGPUArch({"gfx942"});
@ -1786,14 +1776,19 @@ Tensor _grouped_mm_cuda(const Tensor& mat_a, const Tensor& mat_b,
const std::optional<at::Tensor>& offs,
const std::optional<at::Tensor>& bias,
std::optional<c10::ScalarType> out_dtype) {
#ifndef USE_ROCM
_grouped_mm_validate_inputs(mat_a, mat_b, offs, bias, out_dtype);
bool a_b_and_out_are_bf16 = (
mat_a.dtype() == at::kBFloat16 &&
mat_b.dtype() == at::kBFloat16 &&
out_dtype.value_or(at::kBFloat16) == at::kBFloat16
);
#ifndef USE_ROCM
bool use_fast_path = _scaled_mm_allowed_device(/*sm90_only*/true, /*sm100_only*/true) && a_b_and_out_are_bf16;
#else
// _scaled_mm_allowed_device is used here within _grouped_mm_cuda which seems incorrect since scale is not used.
// the _grouped_mm_fallback should be safe for any ROCm GPU since it's just calling typical mm/bmm
bool use_fast_path = false;
#endif
const auto out_dtype_ = _resolve_grouped_mm_out_dtype(mat_a, mat_b, out_dtype);
Tensor out = create_grouped_gemm_output_tensor(mat_a, mat_b, offs, out_dtype_);
if (use_fast_path) {
@ -1803,9 +1798,6 @@ std::optional<c10::ScalarType> out_dtype) {
_grouped_mm_fallback(mat_a, mat_b, offs, bias, out_dtype, out);
}
return out;
#else
TORCH_CHECK(false, "grouped gemm is not supported on ROCM")
#endif
}
Tensor _bmm_dtype_cuda(const Tensor& batch1, const Tensor& batch2, const at::ScalarType out_dtype) {

View File

@ -482,7 +482,7 @@ auto build_graph(
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("CUDNN_SDPA")
.set_is_inference(return_softmaxstats == false)
.set_generate_stats(return_softmaxstats)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale);
if (use_ragged_in_dense(q, k, v, o, attn_bias.has_value())) {
@ -702,7 +702,7 @@ auto build_graph_nestedtensor(
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("CUDNN_SDPA_NESTEDTENSOR")
.set_is_inference(return_softmaxstats == false)
.set_generate_stats(return_softmaxstats)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale)
.set_seq_len_q(SEQ_LEN_Q_)

View File

@ -39,6 +39,13 @@ struct lerp_alpha_functor {
}
};
struct native_dropout_mask_and_scale_functor {
template <typename TI, typename TA>
inline TA operator()(const TI a, const TI b, const TA scale) {
return static_cast<TA>(a) * static_cast<TA>(b) * scale;
}
};
struct fmax_functor {
template <typename T>
inline T operator()(const T a, const T b) {
@ -427,6 +434,10 @@ REGISTER_BINARY_ALPHA_OP(lerp_alpha, uchar, uchar, uchar);
REGISTER_BINARY_ALPHA_OP(lerp_alpha, char, char, char);
REGISTER_BINARY_ALPHA_OP(lerp_alpha, bool, bool, bool);
REGISTER_BINARY_ALPHA_OP(native_dropout_mask_and_scale, float, float, float);
REGISTER_BINARY_ALPHA_OP(native_dropout_mask_and_scale, bfloat, bfloat, bfloat);
REGISTER_BINARY_ALPHA_OP(native_dropout_mask_and_scale, half, half, half);
REGISTER_BINARY_ALPHA_OP(add_alpha, bfloat, bfloat, bfloat);
REGISTER_BINARY_ALPHA_OP(sub_alpha, bfloat, bfloat, bfloat);
REGISTER_BINARY_ALPHA_OP(lerp_alpha, bfloat, bfloat, bfloat);

View File

@ -168,6 +168,10 @@ static void lerp_scalar_mps_kernel(at::TensorIteratorBase& iter, const Scalar& w
lib.exec_binary_kernel(iter, "lerp_alpha", weight);
}
static void native_dropout_mask_and_scale_mps_kernel(at::TensorIteratorBase& iter, const Scalar& scale) {
lib.exec_binary_kernel(iter, "native_dropout_mask_and_scale", scale);
}
static void mul_mps_kernel(TensorIteratorBase& iter) {
lib.exec_binary_kernel(iter, "mul");
}

View File

@ -0,0 +1,45 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/TensorOperators.h>
#include <ATen/mps/MPSGeneratorImpl.h>
#include <ATen/native/Distributions.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/native/mps/operations/BinaryKernel.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/bernoulli.h>
#include <ATen/ops/empty_like.h>
#include <ATen/ops/native_dropout_backward_native.h>
#include <ATen/ops/native_dropout_native.h>
#include <ATen/ops/ones_like.h>
#endif
namespace at::native {
static Tensor native_dropout_mask_and_scale(const Tensor& input, const Tensor& mask, float scale) {
auto output = at::empty_like(input);
mps::binary_op_kernel("native_dropout_mask_and_scale", input, mask, output, scale);
return output;
}
std::tuple<Tensor, Tensor> native_dropout_mps(const Tensor& input, double p, std::optional<bool> train) {
if (input.numel() == 0 || !train.value_or(false) || p == 0) {
return {input.clone(), at::ones_like(input, input.options().dtype(c10::kBool))};
}
float p_comp = 1.0f - p;
Tensor mask = at::empty_like(input, input.options().dtype(c10::kBool));
mask.bernoulli_(p_comp);
auto scale = p_comp == 0 ? 0.0f : 1.0f / p_comp;
Tensor output = native_dropout_mask_and_scale(input, mask, scale);
return {std::move(output), std::move(mask)};
}
Tensor native_dropout_backward_mps(const Tensor& grad, const Tensor& mask, double scale) {
auto grad_float = isFloatingType(grad.scalar_type()) ? grad : grad.to(c10::kFloat);
return native_dropout_mask_and_scale(grad_float, mask, scale);
}
} // namespace at::native

View File

@ -288,6 +288,7 @@
dispatch:
CPU: native_dropout_cpu
CUDA: native_dropout_cuda
MPS: native_dropout_mps
NestedTensorCPU, NestedTensorHPU, NestedTensorCUDA: native_dropout_nested
tags: [nondeterministic_seeded, core]
autogen: native_dropout.out
@ -296,6 +297,7 @@
dispatch:
CPU, NestedTensorCPU, NestedTensorHPU, NestedTensorCUDA: native_dropout_backward
CUDA: native_dropout_backward_cuda
MPS: native_dropout_backward_mps
autogen: native_dropout_backward.out
tags: pointwise
@ -5511,6 +5513,13 @@
tags: core
manual_cpp_binding: True
- func: sym_is_contiguous(Tensor self, MemoryFormat memory_format=contiguous_format) -> SymBool
variants: function
device_check: NoCheck
device_guard: False
tags: core
manual_cpp_binding: True
- func: sym_numel(Tensor self) -> SymInt
variants: function
device_check: NoCheck

View File

@ -391,13 +391,13 @@ void _validate_sparse_coo_tensor_args(
int64_t sparse_dim = indices.size(0);
int64_t dense_dim = values.dim() - 1;
TORCH_CHECK(
static_cast<int64_t>(size.size()) == sparse_dim + dense_dim,
"number of dimensions must be sparse_dim (",
sparse_dim,
") + dense_dim (",
dense_dim,
"), but got ",
size.size());
sparse_dim + dense_dim == static_cast<int64_t>(size.size()),
"'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = ",
size.size(),
", sparse_dim = ",
sparse_dim,
", dense_dim = ",
dense_dim);
if (check_pinning) {
TORCH_CHECK(

View File

@ -948,7 +948,6 @@ def define_buck_targets(
[
("torch/csrc/api/include", "torch/**/*.h"),
("", "torch/csrc/**/*.h"),
("", "torch/csrc/**/*.hpp"),
("", "torch/nativert/**/*.h"),
("", "torch/headeronly/**/*.h"),
("", "torch/script.h"),
@ -2034,7 +2033,6 @@ def define_buck_targets(
("", "caffe2/utils/*.h"),
("", "caffe2/core/*.h"),
("", "torch/csrc/*.h"),
("", "torch/csrc/*.hpp"),
("", "torch/csrc/api/include/torch/*.h"),
("", "torch/csrc/autograd/*.h"),
("", "torch/csrc/autograd/*/*.h"),

View File

@ -313,8 +313,15 @@ void TensorImpl::throw_data_ptr_access_error() const {
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
// TO reduce BC breaking and reduce having to introduce
// sym_is_contiguous. call is_contiguous when tensor does not
if (C10_UNLIKELY(has_symbolic_sizes_strides_)) {
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(
this, memory_format);
} else {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
}
return sym_is_contiguous_default(memory_format);

View File

@ -60,6 +60,10 @@ struct NoopPyInterpreterVTable final : public PyInterpreterVTable {
bool is_contiguous(const TensorImpl* self, at::MemoryFormat) const override {
PANIC(is_contiguous);
}
c10::SymBool sym_is_contiguous(const TensorImpl* self, at::MemoryFormat)
const override {
PANIC(sym_is_contiguous);
}
bool is_strides_like(const TensorImpl* self, at::MemoryFormat)
const override {
PANIC(is_strides_like);

View File

@ -168,6 +168,9 @@ struct C10_API PyInterpreterVTable {
virtual bool is_contiguous(const TensorImpl* self, at::MemoryFormat)
const = 0;
virtual c10::SymBool sym_is_contiguous(
const TensorImpl* self,
at::MemoryFormat) const = 0;
virtual bool is_strides_like(const TensorImpl* self, at::MemoryFormat)
const = 0;
virtual bool is_non_overlapping_and_dense(const TensorImpl* self) const = 0;

View File

@ -78,6 +78,18 @@ int device_count_impl(bool fail_if_no_driver) {
"would like to use GPUs, turn off ASAN.");
break;
#endif // C10_ASAN_ENABLED
#if defined(_WIN32) && CUDA_VERSION >= 13000
// Workaround for CUDA-13.0 error handling on Windows, see
// https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585
case cudaErrorNotSupported:
if (!fail_if_no_driver) {
TORCH_WARN(
"cudaGetDeviceCount() returned cudaErrorNotSupported, "
"likely using older driver or on CPU machine");
count = 0;
break;
}
#endif
default:
TORCH_CHECK(
false,

View File

@ -196,20 +196,25 @@ TTarget* assign_ptr_(TTarget* rhs) {
}
}
// Increment needs to be acquire-release to make use_count() and
// unique() reliable.
// The only requirement for refcount increment is that it happens-before
// decrement, so no additional memory ordering is needed.
inline uint32_t atomic_refcount_increment(std::atomic<uint32_t>& refcount) {
return refcount.fetch_add(1, std::memory_order_acq_rel) + 1;
return refcount.fetch_add(1, std::memory_order_relaxed) + 1;
}
// weak_use_count() is only used for testing, so we don't need it to
// be reliable. Relaxed should be fine.
inline uint32_t atomic_weakcount_increment(std::atomic<uint32_t>& weakcount) {
return weakcount.fetch_add(1, std::memory_order_relaxed) + 1;
}
// Both decrements need to be acquire-release for correctness. See
// e.g. std::shared_ptr implementation.
// The requirement is that all modifications to the managed object happen-before
// invocation of the managed object destructor, and that allocation of the
// managed object storage happens-before deallocation of the storage.
//
// To get this ordering, all non-final decrements must synchronize-with the
// final decrement. So all non-final decrements have to store-release while the
// final decrement has to load-acquire, either directly or with the help of
// fences. But it's easiest just to have all decrements be acq-rel. And it turns
// out, on modern architectures and chips, it's also fastest.
inline uint32_t atomic_refcount_decrement(std::atomic<uint32_t>& refcount) {
return refcount.fetch_sub(1, std::memory_order_acq_rel) - 1;
}
@ -332,7 +337,7 @@ class intrusive_ptr final {
intrusive_ptr() noexcept
: intrusive_ptr(NullType::singleton(), raw::DontIncreaseRefcount{}) {}
intrusive_ptr(std::nullptr_t) noexcept
/* implicit */ intrusive_ptr(std::nullptr_t) noexcept
: intrusive_ptr(NullType::singleton(), raw::DontIncreaseRefcount{}) {}
// This constructor will not increase the ref counter for you.
@ -445,14 +450,14 @@ class intrusive_ptr final {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->refcount_.load(std::memory_order_acquire);
return target_->refcount_.load(std::memory_order_relaxed);
}
uint32_t weak_use_count() const noexcept {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->weakcount_.load(std::memory_order_acquire);
return target_->weakcount_.load(std::memory_order_relaxed);
}
bool unique() const noexcept {
@ -851,14 +856,14 @@ class weak_intrusive_ptr final {
return 0;
}
return target_->refcount_.load(
std::memory_order_acquire); // refcount, not weakcount!
std::memory_order_relaxed); // refcount, not weakcount!
}
uint32_t weak_use_count() const noexcept {
if (target_ == NullType::singleton()) {
return 0;
}
return target_->weakcount_.load(std::memory_order_acquire);
return target_->weakcount_.load(std::memory_order_relaxed);
}
bool expired() const noexcept {
@ -866,18 +871,22 @@ class weak_intrusive_ptr final {
}
intrusive_ptr<TTarget, NullType> lock() const noexcept {
if (expired()) {
if (target_ == NullType::singleton()) {
return intrusive_ptr<TTarget, NullType>();
} else {
auto refcount = target_->refcount_.load(std::memory_order_seq_cst);
auto refcount = target_->refcount_.load(std::memory_order_relaxed);
do {
if (refcount == 0) {
// Object already destructed, no strong references left anymore.
// Return nullptr.
return intrusive_ptr<TTarget, NullType>();
}
} while (
!target_->refcount_.compare_exchange_weak(refcount, refcount + 1));
} while (!target_->refcount_.compare_exchange_weak(
refcount,
refcount + 1,
std::memory_order_acquire,
std::memory_order_relaxed));
return intrusive_ptr<TTarget, NullType>(
target_, raw::DontIncreaseRefcount{});
}

View File

@ -540,9 +540,11 @@ if(NOT INTERN_BUILD_MOBILE AND NOT BUILD_LITE_INTERPRETER)
${TORCH_SRC_DIR}/csrc/utils/byte_order.cpp
)
append_filelist("libtorch_distributed_base_sources" TORCH_SRCS)
if(NOT WIN32)
append_filelist("libtorch_distributed_extra_sources" TORCH_SRCS)
if(USE_DISTRIBUTED)
append_filelist("libtorch_distributed_base_sources" TORCH_SRCS)
if(NOT WIN32)
append_filelist("libtorch_distributed_extra_sources" TORCH_SRCS)
endif()
endif()
endif()
@ -550,6 +552,11 @@ if(USE_CUDA OR USE_ROCM)
append_filelist("libtorch_cuda_core_sources" Caffe2_GPU_HIP_JIT_FUSERS_SRCS)
endif()
if(USE_CUDA)
# eventually do rocm
append_filelist("libtorch_nativert_cuda_sources" Caffe2_GPU_SRCS)
endif()
if(USE_CUDA)
list(APPEND Caffe2_GPU_CU_SRCS ${Caffe2_GPU_HIP_JIT_FUSERS_SRCS})
add_library(caffe2_nvrtc SHARED ${ATen_NVRTC_STUB_SRCS})
@ -566,30 +573,32 @@ if(USE_CUDA)
list(APPEND Caffe2_GPU_SRCS
${TORCH_SRC_DIR}/csrc/cuda/nccl.cpp)
endif()
append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_GPU_SRCS)
if(NOT WIN32)
append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_GPU_SRCS)
set_source_files_properties(
${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupNCCL.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/cuda/utils.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/intra_node_comm.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CudaDMAConnectivity.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CUDASymmetricMemory.cu
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryOps.cu
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryUtils.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/cuda_mem_pool.cpp
PROPERTIES COMPILE_FLAGS "-DPYTORCH_C10_DRIVER_API_SUPPORTED=1"
)
endif()
if(USE_DISTRIBUTED)
append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_GPU_SRCS)
if(NOT WIN32)
append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_GPU_SRCS)
set_source_files_properties(
${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupNCCL.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/cuda/utils.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/intra_node_comm.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CudaDMAConnectivity.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CUDASymmetricMemory.cu
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryOps.cu
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/CUDASymmetricMemoryUtils.cpp
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu
${TORCH_SRC_DIR}/csrc/distributed/c10d/symm_mem/cuda_mem_pool.cpp
PROPERTIES COMPILE_FLAGS "-DPYTORCH_C10_DRIVER_API_SUPPORTED=1"
)
endif()
set(ASYNC_MM_FILE "${TORCH_SRC_DIR}/csrc/distributed/c10d/cuda/AsyncMM.cu")
# Disable the warning to make cutlass warp-specialized cooperative kernel build for gcc-9
if(CMAKE_COMPILER_IS_GNUCXX)
set_source_files_properties(${ASYNC_MM_FILE} PROPERTIES COMPILE_FLAGS "-Wno-unused-but-set-variable")
endif()
if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0 AND CUDA_NVCC_FLAGS MATCHES ".*compute_90.*")
set_source_files_properties(${ASYNC_MM_FILE} PROPERTIES COMPILE_FLAGS "-gencode arch=compute_90a,code=sm_90a")
set(ASYNC_MM_FILE "${TORCH_SRC_DIR}/csrc/distributed/c10d/cuda/AsyncMM.cu")
# Disable the warning to make cutlass warp-specialized cooperative kernel build for gcc-9
if(CMAKE_COMPILER_IS_GNUCXX)
set_source_files_properties(${ASYNC_MM_FILE} PROPERTIES COMPILE_FLAGS "-Wno-unused-but-set-variable")
endif()
if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0 AND CUDA_NVCC_FLAGS MATCHES ".*compute_90.*")
set_source_files_properties(${ASYNC_MM_FILE} PROPERTIES COMPILE_FLAGS "-gencode arch=compute_90a,code=sm_90a")
endif()
endif()
set_source_files_properties(
${TORCH_ROOT}/aten/src/ATen/cuda/detail/LazyNVRTC.cpp
@ -622,9 +631,11 @@ if(USE_ROCM)
list(APPEND Caffe2_HIP_SRCS
${TORCH_SRC_DIR}/csrc/cuda/nccl.cpp)
endif()
append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_HIP_SRCS)
if(NOT WIN32)
append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_HIP_SRCS)
if(USE_DISTRIBUTED)
append_filelist("libtorch_cuda_distributed_base_sources" Caffe2_HIP_SRCS)
if(NOT WIN32)
append_filelist("libtorch_cuda_distributed_extra_sources" Caffe2_HIP_SRCS)
endif()
endif()
# caffe2_nvrtc's stubs to driver APIs are useful for HIP.
# See NOTE [ ATen NVRTC Stub and HIP ]
@ -1345,10 +1356,12 @@ if(BUILD_TEST)
add_subdirectory(${TORCH_ROOT}/test/cpp/jit ${CMAKE_BINARY_DIR}/test_jit)
add_subdirectory(${TORCH_ROOT}/test/cpp/nativert ${CMAKE_BINARY_DIR}/test_nativert)
add_subdirectory(${TORCH_ROOT}/test/inductor ${CMAKE_BINARY_DIR}/test_inductor)
add_subdirectory(${TORCH_ROOT}/test/cpp/c10d ${CMAKE_BINARY_DIR}/test_cpp_c10d)
if(NOT WIN32)
add_subdirectory(${TORCH_ROOT}/test/cpp/dist_autograd ${CMAKE_BINARY_DIR}/dist_autograd)
add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc)
if(USE_DISTRIBUTED)
add_subdirectory(${TORCH_ROOT}/test/cpp/c10d ${CMAKE_BINARY_DIR}/test_cpp_c10d)
if(NOT WIN32)
add_subdirectory(${TORCH_ROOT}/test/cpp/dist_autograd ${CMAKE_BINARY_DIR}/dist_autograd)
add_subdirectory(${TORCH_ROOT}/test/cpp/rpc ${CMAKE_BINARY_DIR}/test_cpp_rpc)
endif()
endif()
if(NOT NO_API)
add_subdirectory(${TORCH_ROOT}/test/cpp/api ${CMAKE_BINARY_DIR}/test_api)
@ -1453,40 +1466,46 @@ if(BUILD_LITE_INTERPRETER)
endif()
endif()
if(USE_GLOO AND USE_C10D_GLOO)
target_compile_definitions(torch_cpu PUBLIC USE_C10D_GLOO)
endif()
if(USE_UCC AND USE_C10D_UCC)
target_compile_definitions(torch_cpu PUBLIC USE_C10D_UCC)
if(USE_CUDA)
target_compile_definitions(torch_cuda PUBLIC USE_C10D_UCC)
# Pass USE_DISTRIBUTED to torch_cpu, as some codes in jit/pickler.cpp and
# jit/unpickler.cpp need to be compiled only when USE_DISTRIBUTED is set
if(USE_DISTRIBUTED)
target_compile_definitions(torch_cpu PUBLIC USE_DISTRIBUTED)
if(USE_GLOO AND USE_C10D_GLOO)
target_compile_definitions(torch_cpu PUBLIC USE_C10D_GLOO)
endif()
endif()
if(USE_NCCL AND USE_C10D_NCCL)
if(USE_ROCM)
target_compile_definitions(torch_hip PUBLIC USE_C10D_NCCL)
else()
target_compile_definitions(torch_cuda PUBLIC USE_C10D_NCCL)
if(USE_UCC AND USE_C10D_UCC)
target_compile_definitions(torch_cpu PUBLIC USE_C10D_UCC)
if(USE_CUDA)
target_compile_definitions(torch_cuda PUBLIC USE_C10D_UCC)
endif()
endif()
endif()
if(USE_MPI AND USE_C10D_MPI)
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set_source_files_properties(
"${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupMPI.cpp"
PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
if(USE_NCCL AND USE_C10D_NCCL)
if(USE_ROCM)
target_compile_definitions(torch_hip PUBLIC USE_C10D_NCCL)
else()
target_compile_definitions(torch_cuda PUBLIC USE_C10D_NCCL)
endif()
endif()
if(USE_MPI AND USE_C10D_MPI)
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set_source_files_properties(
"${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupMPI.cpp"
PROPERTIES COMPILE_FLAGS -Wno-deprecated-declarations)
endif()
target_compile_definitions(torch_cpu PUBLIC USE_C10D_MPI)
endif()
# Pass USE_RPC in order to reduce use of
# #if defined(USE_DISTRIBUTED) && !defined(_WIN32)
# need to be removed when RPC is supported
if(NOT WIN32)
target_compile_definitions(torch_cpu PUBLIC USE_RPC)
endif()
# Pass USE_TENSORPIPE to torch_cpu as some parts of rpc/utils.cpp
# can only be compiled with USE_TENSORPIPE is set.
if(USE_TENSORPIPE)
target_compile_definitions(torch_cpu PUBLIC USE_TENSORPIPE)
endif()
target_compile_definitions(torch_cpu PUBLIC USE_C10D_MPI)
endif()
# Pass USE_RPC in order to reduce use of
# #if defined(USE_DISTRIBUTED) && !defined(_WIN32)
# need to be removed when RPC is supported
if(NOT WIN32)
target_compile_definitions(torch_cpu PUBLIC USE_RPC)
endif()
# Pass USE_TENSORPIPE to torch_cpu as some parts of rpc/utils.cpp
# can only be compiled with USE_TENSORPIPE is set.
if(USE_TENSORPIPE)
target_compile_definitions(torch_cpu PUBLIC USE_TENSORPIPE)
endif()
if(NOT INTERN_BUILD_MOBILE)
@ -1830,6 +1849,12 @@ if(BUILD_TEST)
target_link_libraries(${test_name}_${CPU_CAPABILITY} Sanitizer::undefined)
endif()
endif()
if(USE_LSAN AND TARGET Sanitizer::leak)
target_link_libraries(${test_name}_${CPU_CAPABILITY} Sanitizer::leak)
endif()
if(USE_TSAN AND TARGET Sanitizer::thread)
target_link_libraries(${test_name}_${CPU_CAPABILITY} Sanitizer::thread)
endif()
else()
add_executable(${test_name}_${CPU_CAPABILITY} "${test_src}")
target_link_libraries(${test_name}_${CPU_CAPABILITY} torch_library sleef gtest_main)

View File

@ -108,24 +108,32 @@ if(CAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO AND NOT INTERN_BUILD_MOBILE)
enable_ubsan()
endif()
if(USE_ASAN OR USE_TSAN)
if(USE_ASAN OR USE_LSAN OR USE_TSAN)
find_package(Sanitizer REQUIRED)
if(USE_ASAN)
if(TARGET Sanitizer::address)
list(APPEND Caffe2_DEPENDENCY_LIBS Sanitizer::address)
else()
message(WARNING "Not ASAN found. Suppress this warning with -DUSE_ASAN=OFF.")
message(WARNING "ASAN not found. Suppress this warning with -DUSE_ASAN=OFF.")
caffe2_update_option(USE_ASAN OFF)
endif()
if(TARGET Sanitizer::undefined)
list(APPEND Caffe2_DEPENDENCY_LIBS Sanitizer::undefined)
endif()
endif()
if(USE_LSAN)
if(TARGET Sanitizer::leak)
list(APPEND Caffe2_DEPENDENCY_LIBS Sanitizer::leak)
else()
message(WARNING "LSAN not found. Suppress this warning with -DUSE_LSAN=OFF.")
caffe2_update_option(USE_LSAN OFF)
endif()
endif()
if(USE_TSAN)
if(TARGET Sanitizer::thread)
list(APPEND Caffe2_DEPENDENCY_LIBS Sanitizer::thread)
else()
message(WARNING "Not TSAN found. Suppress this warning with -DUSE_TSAN=OFF.")
message(WARNING "TSAN not found. Suppress this warning with -DUSE_TSAN=OFF.")
caffe2_update_option(USE_TSAN OFF)
endif()
endif()
@ -1126,7 +1134,7 @@ if(USE_CUDA AND CUDA_VERSION VERSION_LESS 13.0)
include_directories(SYSTEM ${CUB_INCLUDE_DIRS})
endif()
if(USE_TENSORPIPE)
if(USE_DISTRIBUTED AND USE_TENSORPIPE)
if(MSVC)
message(WARNING "Tensorpipe cannot be used on Windows.")
else()

View File

@ -66,6 +66,7 @@ function(caffe2_print_configuration_summary)
message(STATUS " LAPACK : ${LAPACK_INFO}")
endif()
message(STATUS " USE_ASAN : ${USE_ASAN}")
message(STATUS " USE_LSAN : ${USE_LSAN}")
message(STATUS " USE_TSAN : ${USE_TSAN}")
message(STATUS " USE_CPP_CODE_COVERAGE : ${USE_CPP_CODE_COVERAGE}")
message(STATUS " USE_CUDA : ${USE_CUDA}")
@ -157,6 +158,7 @@ function(caffe2_print_configuration_summary)
if(${USE_KLEIDIAI})
message(STATUS " USE_KLEIDIAI : ${USE_KLEIDIAI}")
endif()
message(STATUS " USE_PRIORITIZED_TEXT_FOR_LD : ${USE_PRIORITIZED_TEXT_FOR_LD}")
message(STATUS " USE_UCC : ${USE_UCC}")
if(${USE_UCC})
message(STATUS " USE_SYSTEM_UCC : ${USE_SYSTEM_UCC}")
@ -191,11 +193,13 @@ function(caffe2_print_configuration_summary)
message(STATUS " USE_PYTORCH_QNNPACK : ${USE_PYTORCH_QNNPACK}")
message(STATUS " USE_XNNPACK : ${USE_XNNPACK}")
message(STATUS " USE_DISTRIBUTED : ${USE_DISTRIBUTED}")
message(STATUS " USE_MPI : ${USE_MPI}")
message(STATUS " USE_GLOO : ${USE_GLOO}")
message(STATUS " USE_GLOO_WITH_OPENSSL : ${USE_GLOO_WITH_OPENSSL}")
message(STATUS " USE_GLOO_IBVERBS : ${USE_GLOO_IBVERBS}")
message(STATUS " USE_TENSORPIPE : ${USE_TENSORPIPE}")
if(${USE_DISTRIBUTED})
message(STATUS " USE_MPI : ${USE_MPI}")
message(STATUS " USE_GLOO : ${USE_GLOO}")
message(STATUS " USE_GLOO_WITH_OPENSSL : ${USE_GLOO_WITH_OPENSSL}")
message(STATUS " USE_GLOO_IBVERBS : ${USE_GLOO_IBVERBS}")
message(STATUS " USE_TENSORPIPE : ${USE_TENSORPIPE}")
endif()
if(NOT "${SELECTED_OP_LIST}" STREQUAL "")
message(STATUS " SELECTED_OP_LIST : ${SELECTED_OP_LIST}")
endif()

View File

@ -482,6 +482,7 @@ function(torch_update_find_cuda_flags)
endfunction()
include(CheckCXXCompilerFlag)
include(CheckLinkerFlag)
##############################################################################
# CHeck if given flag is supported and append it to provided outputvar
@ -511,3 +512,22 @@ function(target_compile_options_if_supported target flag)
target_compile_options(${target} PRIVATE ${flag})
endif()
endfunction()
# Check if a global link option is supported
function(add_link_options_if_supported flag)
check_linker_flag(C "LINKER:${flag}" _supported)
if("${_supported}")
add_link_options("LINKER:${flag}")
else()
message(WARNING "Attempted to use unsupported link option : ${flag}.")
endif()
endfunction()
function(target_link_options_if_supported tgt flag)
check_linker_flag(C "LINKER:${flag}" _supported)
if("${_supported}")
target_link_options("${tgt}" PRIVATE "LINKER:${flag}")
else()
message(WARNING "Attempted to use unsupported link option : ${flag}.")
endif()
endfunction()

View File

@ -2,6 +2,10 @@
Since PyTorch 2.1, the community has made significant progress in streamlining the process of integrating new accelerators into the PyTorch ecosystem. These improvements include, but are not limited to: refinements to the `PrivateUse1` Dispatch Key, the introduction and enhancement of core subsystem extension mechanisms, and the device-agnostic refactoring of key modules (e.g., `torch.accelerator`, `memory management`). Taken together, these advances provide the foundation for a **robust**, **flexible**, and **developer-friendly** pathway for accelerator integration.
```{note}
This guide is a work in progress. For more details, please refer to the [roadmap](https://github.com/pytorch/pytorch/issues/158917).
```
## Why Does This Matter?
This integration pathway offers several major benefits:
@ -10,16 +14,6 @@ This integration pathway offers several major benefits:
* **Future-proofing**: This is the default integration path for all future PyTorch features, meaning that as new modules and features are added, they will automatically support scaling to new accelerators if this path is followed.
* **Autonomy**: Vendors maintain full control over their accelerator integration timelines, enabling fast iteration cycles and reducing reliance on upstream coordination.
## About This Document
This guide aims to provide a **comprehensive overview of the modern integration pathway** for new accelerator in PyTorch. It walks through the full integration surface, from low-level device primitives to higher-level domain modules like compilation and quantization. The structure follows a **modular and scenario-driven approach**, where each topic is paired with corresponding code examples from [torch_openreg][OpenReg URL], an official reference implementation.
The goal is to help developers:
* Understand the full scope of accelerator integration;
* Follow best practices to quickly launch new accelerators;
* Avoid common pitfalls through clear, targeted examples.
## Target Audience
This document is intended for:
@ -27,20 +21,22 @@ This document is intended for:
* **Accelerator Developers** who are integrating accelerator into PyTorch;
* **Advanced PyTorch Users** interested in the inner workings of key modules;
## Quick Overview
## About This Document
This document outlines the key processes and practical scenarios involved in integrating new devices into PyTorch, providing developers with a comprehensive and detailed guide for bringing up new backends. The discussion is structured around four major axes:
This guide aims to provide a **comprehensive overview of the modern integration pathway** for new accelerator in PyTorch. It walks through the full integration surface, from low-level device primitives to higher-level domain modules like compilation and quantization. The structure follows a **modular and scenario-driven approach**, where each topic is paired with corresponding code examples from [torch_openreg][OpenReg URL], an official reference implementation, and this series is structured around four major axes:
* **Runtime**: Covers core components such as Event, Stream, Memory, Generator, Guard, Hooks, as well as the supporting C++ scaffolding.
* **Operators**: Involve the minimum necessary set of operators, forward and backward operators, fallback operators, fallthroughs, STUBs, etc. in both C++ and Python implementations.
* **Python Frontend**: Focuses on Python bindings for modules and device-agnostic APIs.
* **High-level Modules**: Explores integration with major subsystems such as `AMP`, `Compiler`, `ONNX`, and `Distributed` and so on.
Next, we will officially embark on the integration journey for a new PyTorch accelerator.
The goal is to help developers:
```{note}
This guide is a work in progress. For more details, please refer to the [roadmap](https://github.com/pytorch/pytorch/issues/158917).
```
* Understand the full scope of accelerator integration;
* Follow best practices to quickly launch new accelerators;
* Avoid common pitfalls through clear, targeted examples.
Next, we will delve into each chapter of this guide. Each chapter focuses on a key aspect of integration, providing detailed explanations and illustrative examples. Since some chapters build upon previous ones, readers are encouraged to follow the sequence to achieve a more coherent understanding.
```{toctree}
:glob:

View File

@ -3333,6 +3333,13 @@ def coverage_post_process(app, exception):
if not isinstance(app.builder, CoverageBuilder):
return
if not torch.distributed.is_available():
raise RuntimeError(
"The coverage tool cannot run with a version "
"of PyTorch that was built with USE_DISTRIBUTED=0 "
"as this module's API changes."
)
# These are all the modules that have "automodule" in an rst file
# These modules are the ones for which coverage is checked
# Here, we make sure that no module is missing from that list

View File

@ -5,7 +5,7 @@
# Tensor Parallelism - torch.distributed.tensor.parallel
Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor
(DTensor)[https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md]
([DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md))
and provides different parallelism styles: Colwise, Rowwise, and Sequence Parallelism.
:::{warning}
@ -89,4 +89,4 @@ Parallelized cross-entropy loss computation (loss parallelism), is supported via
```
:::{warning}
The loss_parallel API is experimental and subject to change.
:::
:::

View File

@ -22,18 +22,22 @@ The following is a sample archive. We will walk through the archive folder by fo
├── data
│ ├── aotinductor
│ │ └── model1
│ │ ├── aotinductor_pickle_data.json
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.cpp
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.so
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.kernel_metadata.json
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.kernel.cpp
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.wrapper_metadata.json
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.wrapper.cpp
│ │ ├── cf5ez6ifexr7i2hezzz4s7xfusj4wtisvu2gddeamh37bw6bghjw.wrapper.so
│ │ ├── cg7domx3woam3nnliwud7yvtcencqctxkvvcafuriladwxw4nfiv.cubin
│ │ └── cubaaxppb6xmuqdm4bej55h2pftbce3bjyyvljxbtdfuolmv45ex.cubin
│ ├── weights
│ │ ├── model1_model_param_config.json
│ │ ├── model1_weights_config.json
│ │ ├── model2_weights_config.json
│ │ ├── weight_0
│ │ ├── weight_1
│ │ ├── weight_2
│ └── constants
│ │ ├── model1_model_constants_config.json
│ │ ├── model1_constants_config.json
│ │ ├── model2_constants_config.json
│ │ ├── tensor_0
│ │ ├── tensor_1
│ │ ├── custom_obj_0
@ -67,11 +71,12 @@ example, compilation artifacts for the `model1` model on A100 and H100 will be
saved in `model1-a100` and `model1-h100` folders separately.
The folder typically contains
* `<uuid>.so`: Dynamic library compiled from <uuid>.cpp.
* `<uuid>.cpp`: AOTInductor generated cpp wrapper file.
* `<uuid>.wrapper.so`: Dynamic library compiled from <uuid>.cpp.
* `<uuid>.wrapper.cpp`: AOTInductor generated cpp wrapper file.
* `<uuid>.kernel.cpp`: AOTInductor generated cpp kernel file.
* `*.cubin`: Triton kernels compiled from triton codegen kernels
* `<uuid>.wrapper_metadata.json`: Metadata which was passed in from the `aot_inductor.metadata` inductor config
* (optional) `<uuid>.json`: External fallback nodes for custom ops to be executed by `ProxyExecutor`, serialized according to `ExternKernelNode` struct. If the model doesnt use custom ops/ProxyExecutor, this file would be omitted.
* `<uuid>_metadata.json`: Metadata which was passed in from the `aot_inductor.metadata` inductor config
### Weights
@ -79,16 +84,16 @@ Path: `/data/weights/*`
Model parameters and buffers are saved in the `/data/weights/` folder. Each
tensor is saved as a separated file. The file only contains the raw data blob,
tensor metadata are saved separately in the
`<model_name>_model_param_config.json`.
tensor metadata and mapping from model weight FQN to saved raw data blob are saved separately in the
`<model_name>_weights_config.json`.
### Constants
Path: `/data/constants/*`
TensorConstants, non-persistent buffers and TorchBind objects are saved in the
`/data/constants/` folder. Metadata is saved separately in the
`<model_name>_model_constants_config.json`
`/data/constants/` folder. Metadata and mapping from model constant FQN to saved raw data blob are saved separately in the
`<model_name>_constants_config.json`
### Sample Inputs

View File

@ -14,6 +14,12 @@
```{eval-rst}
.. autofunction:: flex_attention
```
```{eval-rst}
.. autoclass:: AuxOutput
```
```{eval-rst}
.. autoclass:: AuxRequest
```
## BlockMask Utilities

View File

@ -102,6 +102,7 @@ also be interested in reading our [development wiki](https://github.com/pytorch/
onnx_export
onnx_ops
onnx_verification
onnx_testing
```
### Deprecated APIs

View File

@ -0,0 +1,9 @@
# torch.onnx.testing
```{eval-rst}
.. automodule:: torch.onnx.testing
```
```{eval-rst}
.. autofunction:: torch.onnx.testing.assert_onnx_program
```

View File

@ -227,9 +227,6 @@
# Static link mimalloc into C10, and use mimalloc in alloc_cpu & alloc_free.
# By default, It is only enabled on Windows.
#
# USE_PRIORITIZED_TEXT_FOR_LD
# Uses prioritized text form cmake/prioritized_text.txt for LD
#
# BUILD_LIBTORCH_WHL
# Builds libtorch.so and its dependencies as a wheel
#
@ -323,7 +320,6 @@ from tools.setup_helpers.env import (
IS_LINUX,
IS_WINDOWS,
)
from tools.setup_helpers.generate_linker_script import gen_linker_script
def str2bool(value: str | None) -> bool:
@ -1627,26 +1623,6 @@ def main() -> None:
if BUILD_PYTHON_ONLY:
install_requires += [f"{LIBTORCH_PKG_NAME}=={TORCH_VERSION}"]
if str2bool(os.getenv("USE_PRIORITIZED_TEXT_FOR_LD")):
gen_linker_script(
filein="cmake/prioritized_text.txt", fout="cmake/linker_script.ld"
)
linker_script_path = os.path.abspath("cmake/linker_script.ld")
os.environ["LDFLAGS"] = os.getenv("LDFLAGS", "") + f" -T{linker_script_path}"
os.environ["CFLAGS"] = (
os.getenv("CFLAGS", "") + " -ffunction-sections -fdata-sections"
)
os.environ["CXXFLAGS"] = (
os.getenv("CXXFLAGS", "") + " -ffunction-sections -fdata-sections"
)
elif platform.system() == "Linux" and platform.processor() == "aarch64":
print_box(
"""
WARNING: we strongly recommend enabling linker script optimization for ARM + CUDA.
To do so please export USE_PRIORITIZED_TEXT_FOR_LD=1
"""
)
# Parse the command line and check the arguments before we proceed with
# building deps and setup. We need to set values so `--help` works.
dist = Distribution()

View File

@ -1,4 +1,4 @@
if(NOT WIN32)
if(USE_DISTRIBUTED AND NOT WIN32)
set(DIST_AUTOGRAD_TEST_DIR "${TORCH_ROOT}/test/cpp/dist_autograd")
set(DIST_AUTOGRAD_TEST_SOURCES
${TORCH_ROOT}/test/cpp/common/main.cpp

View File

@ -40,26 +40,30 @@ set(NATIVERT_TEST_SRCS
${TORCH_ROOT}/torch/nativert/graph/passes/pass_manager/GraphPasses.cpp
${TORCH_ROOT}/torch/nativert/graph/passes/pass_manager/PassManager.cpp
${TORCH_ROOT}/torch/nativert/kernels/KernelHandlerRegistry.cpp
${TORCH_ROOT}/torch/nativert/kernels/TritonKernel.cpp
${TORCH_ROOT}/torch/nativert/executor/triton/CpuTritonKernelManager.cpp
${TORCH_ROOT}/torch/nativert/kernels/TritonKernel.cpp
${TORCH_ROOT}/torch/nativert/executor/DelegateExecutor.cpp
)
if(USE_CUDA)
list(APPEND NATIVERT_TEST_SRCS ${TORCH_ROOT}/torch/nativert/executor/triton/CudaTritonKernelManager.cpp)
endif(MSVC)
endif()
add_executable(test_nativert
${TORCH_ROOT}/test/cpp/common/main.cpp
${NATIVERT_TEST_SRCS}
)
if(MSVC)
target_compile_definitions(test_nativert PRIVATE NATIVERT_MSVC_TEST)
endif()
# TODO temporary until we can delete the old gtest polyfills.
target_compile_definitions(test_nativert PRIVATE USE_GTEST)
set(NATIVERT_TEST_DEPENDENCIES torch gtest_main)
target_link_libraries(test_nativert PRIVATE ${CMAKE_DL_LIBS})
target_link_libraries(test_nativert PRIVATE ${NATIVERT_TEST_DEPENDENCIES})
target_link_libraries(test_nativert PRIVATE fmt::fmt-header-only)
target_include_directories(test_nativert PRIVATE ${ATen_CPU_INCLUDE})

View File

@ -6,9 +6,20 @@ using namespace ::testing;
using namespace torch::nativert;
TEST(TritonKernelManagerRegistrationTests, TestRegister) {
#ifndef USE_CUDA
EXPECT_TRUE(create_cuda_triton_kernel_manager == nullptr);
EXPECT_TRUE(TritonKernelManagerRegistry()->Has(at::kCPU));
#ifdef USE_CUDA
#ifdef USE_ROCM
EXPECT_TRUE(TritonKernelManagerRegistry()->Has(at::kHIP));
EXPECT_FALSE(TritonKernelManagerRegistry()->Has(at::kCUDA));
#else
EXPECT_FALSE(create_cuda_triton_kernel_manager == nullptr);
EXPECT_TRUE(TritonKernelManagerRegistry()->Has(at::kCUDA));
EXPECT_FALSE(TritonKernelManagerRegistry()->Has(at::kHIP));
#endif // USE_ROCM
#else
EXPECT_FALSE(TritonKernelManagerRegistry()->Has(at::kCUDA));
EXPECT_FALSE(TritonKernelManagerRegistry()->Has(at::kHIP));
#endif // USE_CUDA
}

View File

@ -28,7 +28,11 @@ from torch.testing._internal.common_fsdp import (
patch_reduce_scatter,
reduce_scatter_with_assert,
)
from torch.testing._internal.common_utils import run_tests, skipIfRocm, TEST_HPU
from torch.testing._internal.common_utils import (
run_tests,
skipIfRocmVersionLessThan,
TEST_HPU,
)
device_type = torch.device(get_devtype())
@ -86,7 +90,7 @@ class TestFullyShardMixedPrecisionTraining(FSDPTest):
use_shard_placement_fn_vals.append(True)
return use_shard_placement_fn_vals
@skipIfRocm # regressed in ROCm 6.4, but ROCm 6.5 fixes it
@skipIfRocmVersionLessThan((7, 0))
@skip_if_lt_x_gpu(2)
@requires_nccl_version((2, 10), "Need NCCL 2.10+ for bf16 collectives")
def test_compute_dtype(self):
@ -166,7 +170,7 @@ class TestFullyShardMixedPrecisionTraining(FSDPTest):
self.assertEqual(fsdp_loss, ref_loss)
check_sharded_parity(self, ref_model, model)
@skipIfRocm # regressed in ROCm 6.4, but ROCm 6.5 fixes it
@skipIfRocmVersionLessThan((7, 0))
@skip_if_lt_x_gpu(2)
@requires_nccl_version((2, 10), "Need NCCL 2.10+ for bf16 collectives")
def test_reduce_dtype(self):

View File

@ -7,7 +7,6 @@ import torch.nn as nn
from torch.distributed._tools.mem_tracker import MemTracker
from torch.testing._internal.common_utils import (
run_tests,
skipIfRocm,
skipIfTorchDynamo,
TEST_CUDA,
TEST_XPU,
@ -34,7 +33,6 @@ class TestMemTracker(TestCase):
@unittest.skipIf(
not TEST_CUDA and not TEST_XPU, "Neither CUDA or XPU is not available"
)
@skipIfRocm()
def test_accelerator_tracker_equivalence(
self,
):

View File

@ -0,0 +1,331 @@
#!/usr/bin/env python3
# Owner(s): ["oncall: r2p"]
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
import os
import signal
from unittest.mock import MagicMock, patch
from torch.distributed.elastic.multiprocessing.api import (
_terminate_process_handler,
PContext,
SignalException,
)
from torch.testing._internal.common_utils import run_tests, TestCase
class SignalHandlingTest(TestCase):
def setUp(self):
# Save original environment variable if it exists
self.original_signals_env = os.environ.get(
"TORCHELASTIC_SIGNALS_TO_HANDLE", None
)
def tearDown(self):
# Restore original environment variable
if self.original_signals_env is not None:
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = self.original_signals_env
elif "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
def test_terminate_process_handler(self):
"""Test that the terminate process handler raises SignalException with the correct signal."""
signum = signal.SIGTERM
with self.assertRaises(SignalException) as cm:
_terminate_process_handler(signum, None)
self.assertEqual(cm.exception.sigval, signal.SIGTERM)
# The signal is represented as a number in the string representation
self.assertIn(f"Process {os.getpid()} got signal: {signum}", str(cm.exception))
@patch("torch.distributed.elastic.multiprocessing.api.threading")
@patch("torch.distributed.elastic.multiprocessing.api.signal")
@patch("torch.distributed.elastic.multiprocessing.api.logger")
def test_start_registers_default_signals(
self, mock_logger, mock_signal, mock_threading
):
"""Test that the start method registers the default signals."""
# Setup
mock_threading.current_thread.return_value = (
mock_threading.main_thread.return_value
)
mock_pcontext = MagicMock(spec=PContext)
# Mock the _stdout_tail and _stderr_tail attributes
mock_pcontext._stdout_tail = MagicMock()
mock_pcontext._stderr_tail = MagicMock()
# Remove environment variable if it exists to test default behavior
if "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
# Call the start method
PContext.start(mock_pcontext)
# Verify that the signal handler was registered for the default signals
expected_signals = ["SIGTERM", "SIGINT", "SIGHUP", "SIGQUIT"]
# Count the number of calls to signal.signal
signal_calls = 0
for call in mock_signal.signal.call_args_list:
args, _ = call
sig, handler = args
signal_calls += 1
# Verify the handler is our _terminate_process_handler
self.assertEqual(handler, _terminate_process_handler)
# Verify we registered the expected number of signals
self.assertEqual(signal_calls, len(expected_signals))
# Verify _start was called
mock_pcontext._start.assert_called_once()
# Verify _stdout_tail.start() and _stderr_tail.start() were called
mock_pcontext._stdout_tail.start.assert_called_once()
mock_pcontext._stderr_tail.start.assert_called_once()
@patch("torch.distributed.elastic.multiprocessing.api.threading")
@patch("torch.distributed.elastic.multiprocessing.api.signal")
@patch("torch.distributed.elastic.multiprocessing.api.logger")
def test_start_registers_custom_signals(
self, mock_logger, mock_signal, mock_threading
):
"""Test that the start method registers custom signals from the environment variable."""
# Setup
mock_threading.current_thread.return_value = (
mock_threading.main_thread.return_value
)
mock_pcontext = MagicMock(spec=PContext)
# Mock the _stdout_tail and _stderr_tail attributes
mock_pcontext._stdout_tail = MagicMock()
mock_pcontext._stderr_tail = MagicMock()
# Set custom signals in the environment variable
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = "SIGTERM,SIGUSR1,SIGUSR2"
# Call the start method
PContext.start(mock_pcontext)
# Verify that the signal handler was registered for the custom signals
expected_signals = ["SIGTERM", "SIGUSR1", "SIGUSR2"]
# Count the number of calls to signal.signal
signal_calls = 0
for call in mock_signal.signal.call_args_list:
args, _ = call
sig, handler = args
signal_calls += 1
# Verify the handler is our _terminate_process_handler
self.assertEqual(handler, _terminate_process_handler)
# Verify we registered the expected number of signals
self.assertEqual(signal_calls, len(expected_signals))
# Verify _start was called
mock_pcontext._start.assert_called_once()
@patch("torch.distributed.elastic.multiprocessing.api.threading")
@patch("torch.distributed.elastic.multiprocessing.api.signal")
@patch("torch.distributed.elastic.multiprocessing.api.logger")
def test_start_handles_invalid_signals(
self, mock_logger, mock_signal, mock_threading
):
"""Test that the start method handles invalid signals gracefully."""
# Setup
mock_threading.current_thread.return_value = (
mock_threading.main_thread.return_value
)
mock_pcontext = MagicMock(spec=PContext)
# Mock the _stdout_tail and _stderr_tail attributes
mock_pcontext._stdout_tail = MagicMock()
mock_pcontext._stderr_tail = MagicMock()
# Set invalid signals in the environment variable
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = "SIGTERM,INVALID_SIGNAL"
# Mock the signal module to not have the INVALID_SIGNAL attribute
# but have SIGTERM
mock_signal.SIGTERM = signal.SIGTERM
# Remove INVALID_SIGNAL attribute if it exists
if hasattr(mock_signal, "INVALID_SIGNAL"):
delattr(mock_signal, "INVALID_SIGNAL")
# Call the start method
PContext.start(mock_pcontext)
# Verify that the warning was logged for the invalid signal
# The exact message may vary, so let's check if warning was called with INVALID_SIGNAL
warning_calls = [
call
for call in mock_logger.warning.call_args_list
if "INVALID_SIGNAL" in str(call)
]
self.assertTrue(len(warning_calls) > 0, "Expected warning about INVALID_SIGNAL")
# Verify _start was called
mock_pcontext._start.assert_called_once()
@patch("torch.distributed.elastic.multiprocessing.api.threading")
@patch("torch.distributed.elastic.multiprocessing.api.signal")
@patch("torch.distributed.elastic.multiprocessing.api.logger")
def test_start_handles_windows_signals(
self, mock_logger, mock_signal, mock_threading
):
"""Test that the start method handles Windows-specific signal behavior."""
# Setup
mock_threading.current_thread.return_value = (
mock_threading.main_thread.return_value
)
mock_pcontext = MagicMock(spec=PContext)
# Mock the _stdout_tail and _stderr_tail attributes
mock_pcontext._stdout_tail = MagicMock()
mock_pcontext._stderr_tail = MagicMock()
# Set signals including ones not supported on Windows
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = "SIGTERM,SIGHUP,SIGUSR1"
# Mock signal attributes
mock_signal.SIGTERM = signal.SIGTERM
mock_signal.SIGHUP = signal.SIGHUP
mock_signal.SIGUSR1 = signal.SIGUSR1
# Mock IS_WINDOWS to be True
with patch("torch.distributed.elastic.multiprocessing.api.IS_WINDOWS", True):
# Mock signal.signal to raise RuntimeError for Windows-unsupported signals
def signal_side_effect(sig, handler):
if sig in [signal.SIGHUP, signal.SIGUSR1]:
raise RuntimeError("Signal not supported on Windows")
mock_signal.signal.side_effect = signal_side_effect
# Call the start method
PContext.start(mock_pcontext)
# Verify that the info was logged for the unsupported signals
# Check if any info calls contain the expected messages
info_calls = [str(call) for call in mock_logger.info.call_args_list]
sighup_logged = any(
"SIGHUP" in call and "Windows" in call for call in info_calls
)
sigusr1_logged = any(
"SIGUSR1" in call and "Windows" in call for call in info_calls
)
self.assertTrue(
sighup_logged,
f"Expected SIGHUP Windows message in info calls: {info_calls}",
)
self.assertTrue(
sigusr1_logged,
f"Expected SIGUSR1 Windows message in info calls: {info_calls}",
)
# Verify _start was called
mock_pcontext._start.assert_called_once()
@patch("torch.distributed.elastic.multiprocessing.api.threading")
@patch("torch.distributed.elastic.multiprocessing.api.logger")
def test_start_not_main_thread(self, mock_logger, mock_threading):
"""Test that the start method warns when not called from the main thread."""
# Setup
mock_threading.current_thread.return_value = MagicMock() # Not the main thread
mock_threading.main_thread.return_value = MagicMock()
mock_pcontext = MagicMock(spec=PContext)
# Mock the _stdout_tail and _stderr_tail attributes
mock_pcontext._stdout_tail = MagicMock()
mock_pcontext._stderr_tail = MagicMock()
# Call the start method
PContext.start(mock_pcontext)
# Verify that the warning was logged
mock_logger.warning.assert_called_with(
"Failed to register signal handlers since torchelastic is running on a child thread. "
"This could lead to orphaned worker processes if the torchrun is terminated."
)
# Verify _start was called
mock_pcontext._start.assert_called_once()
@patch("torch.distributed.elastic.multiprocessing.api.threading")
@patch("torch.distributed.elastic.multiprocessing.api.signal")
@patch("torch.distributed.elastic.multiprocessing.api.logger")
def test_start_supports_sigusr1_and_sigusr2(
self, mock_logger, mock_signal, mock_threading
):
"""Test that the start method properly supports SIGUSR1 and SIGUSR2 signals."""
# Setup
mock_threading.current_thread.return_value = (
mock_threading.main_thread.return_value
)
mock_pcontext = MagicMock(spec=PContext)
# Mock the _stdout_tail and _stderr_tail attributes
mock_pcontext._stdout_tail = MagicMock()
mock_pcontext._stderr_tail = MagicMock()
# Set environment variable to include SIGUSR1 and SIGUSR2
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = "SIGUSR1,SIGUSR2"
# Mock signal attributes to have SIGUSR1 and SIGUSR2
mock_signal.SIGUSR1 = signal.SIGUSR1
mock_signal.SIGUSR2 = signal.SIGUSR2
# Call the start method
PContext.start(mock_pcontext)
# Verify that signal.signal was called for both SIGUSR1 and SIGUSR2
signal_calls = mock_signal.signal.call_args_list
registered_signals = [
call[0][0] for call in signal_calls
] # Extract the signal from each call
# Verify both SIGUSR1 and SIGUSR2 were registered
self.assertIn(
signal.SIGUSR1, registered_signals, "SIGUSR1 should be registered"
)
self.assertIn(
signal.SIGUSR2, registered_signals, "SIGUSR2 should be registered"
)
# Verify the correct handler was registered for both signals
for call in signal_calls:
sig, handler = call[0]
if sig in [signal.SIGUSR1, signal.SIGUSR2]:
self.assertEqual(
handler,
_terminate_process_handler,
f"Signal {sig} should use _terminate_process_handler",
)
# Verify that info messages were logged for successful registration
info_calls = [str(call) for call in mock_logger.info.call_args_list]
sigusr1_logged = any(
"SIGUSR1" in call and "Registered signal handler" in call
for call in info_calls
)
sigusr2_logged = any(
"SIGUSR2" in call and "Registered signal handler" in call
for call in info_calls
)
self.assertTrue(
sigusr1_logged,
f"Expected SIGUSR1 registration message in info calls: {info_calls}",
)
self.assertTrue(
sigusr2_logged,
f"Expected SIGUSR2 registration message in info calls: {info_calls}",
)
# Verify _start was called
mock_pcontext._start.assert_called_once()
# Verify _stdout_tail.start() and _stderr_tail.start() were called
mock_pcontext._stdout_tail.start.assert_called_once()
mock_pcontext._stderr_tail.start.assert_called_once()
if __name__ == "__main__":
run_tests()

View File

@ -116,7 +116,6 @@ class DistributedUtilTest(TestCase):
timeout=1,
)
@skipIfRocm
def test_create_store_timeout_on_worker(self):
with self.assertRaises(DistNetworkError):
# use any available port (port 0) since timeout is expected

View File

@ -38,7 +38,6 @@ from torch.testing._internal.common_utils import (
instantiate_parametrized_tests,
parametrize,
run_tests,
skipIfRocm,
TEST_WITH_DEV_DBG_ASAN,
)
@ -514,7 +513,6 @@ class TestFSDPOptimState(FSDPTest):
continue
self.assertEqual(full_osd_value, ref_osd_pg[name])
@skipIfRocm
@skip_if_lt_x_gpu(2)
@parametrize("state_dict_type", STATE_DICT_TYPES)
@parametrize("use_multiple_param_groups", [False, True])

View File

@ -0,0 +1,100 @@
#!/usr/bin/env python3
# Owner(s): ["oncall: r2p"]
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
import os
from unittest.mock import MagicMock, patch
from torch.distributed.launcher.api import launch_agent, LaunchConfig
from torch.testing._internal.common_utils import run_tests, TestCase
class LauncherApiTest(TestCase):
def setUp(self):
# Save original environment variable if it exists
self.original_signals_env = os.environ.get(
"TORCHELASTIC_SIGNALS_TO_HANDLE", None
)
def tearDown(self):
# Restore original environment variable
if self.original_signals_env is not None:
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = self.original_signals_env
elif "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
@patch("torch.distributed.launcher.api.LocalElasticAgent")
@patch("torch.distributed.launcher.api.rdzv_registry.get_rendezvous_handler")
def test_launch_agent_sets_signals_env_var(self, mock_get_handler, mock_agent):
"""Test that launch_agent sets the TORCHELASTIC_SIGNALS_TO_HANDLE environment variable."""
# Setup
config = LaunchConfig(
min_nodes=1,
max_nodes=1,
nproc_per_node=1,
signals_to_handle="SIGTERM,SIGUSR1,SIGUSR2",
)
entrypoint = "dummy_script.py"
args = []
# Make sure the environment variable doesn't exist before the test
if "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
# Mock agent.run() to return a MagicMock
mock_agent_instance = MagicMock()
mock_agent_instance.run.return_value = MagicMock(
is_failed=lambda: False, return_values={}
)
mock_agent.return_value = mock_agent_instance
# Call launch_agent
launch_agent(config, entrypoint, args)
# Verify that the environment variable was set correctly
self.assertEqual(
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"], "SIGTERM,SIGUSR1,SIGUSR2"
)
@patch("torch.distributed.launcher.api.LocalElasticAgent")
@patch("torch.distributed.launcher.api.rdzv_registry.get_rendezvous_handler")
def test_launch_agent_default_signals(self, mock_get_handler, mock_agent):
"""Test that launch_agent uses the default signals if not specified."""
# Setup
config = LaunchConfig(
min_nodes=1,
max_nodes=1,
nproc_per_node=1,
# Not specifying signals_to_handle, should use default
)
entrypoint = "dummy_script.py"
args = []
# Make sure the environment variable doesn't exist before the test
if "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
# Mock agent.run() to return a MagicMock
mock_agent_instance = MagicMock()
mock_agent_instance.run.return_value = MagicMock(
is_failed=lambda: False, return_values={}
)
mock_agent.return_value = mock_agent_instance
# Call launch_agent
launch_agent(config, entrypoint, args)
# Verify that the environment variable was set to the default value
self.assertEqual(
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"],
"SIGTERM,SIGINT,SIGHUP,SIGQUIT",
)
if __name__ == "__main__":
run_tests()

View File

@ -20,7 +20,15 @@ from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
)
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.distributed.tensor import (
DeviceMesh,
distribute_module,
distribute_tensor,
DTensor,
Partial,
Replicate,
Shard,
)
from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta
from torch.distributed.tensor.parallel import (
ColwiseParallel,
@ -88,6 +96,33 @@ aot_eager_graph = aot_autograd(
)
def _apply_sharding(mod: nn.Module, shard_dim: int, device_mesh: DeviceMesh):
"""
Shards on the given dimension if possible, else replicate
Args:
mod: (nn.Module) Module to shard or replicate
shard_dim: (int) Dimension to shard on if possible
device_mesh: (DeviceMesh) 1D Device Mesh
Returns:
Sharded DTensor
"""
def shard_module_params(name, module, device_mesh):
for name, param in module.named_parameters():
placement = Replicate()
if shard_dim < len(param.size()):
placement = Shard(shard_dim)
dist_param = torch.nn.Parameter(
distribute_tensor(param, device_mesh, [placement])
)
name = name.split(".")[-1]
module.register_parameter(name, dist_param)
sharded_mod = distribute_module(mod, device_mesh, shard_module_params)
return sharded_mod
class TestDTensorCompile(torch._dynamo.test_case.TestCase):
def setUp(self):
super(
@ -148,7 +183,7 @@ class TestDTensorCompile(torch._dynamo.test_case.TestCase):
)
torch.utils._pytree.register_constant(DeviceMesh)
ep = torch.export.export_for_training(
ep = torch.export.export(
Foo(), (torch.randn(4, 4, dtype=torch.float64),), strict=False
)
self.assertExpectedInline(
@ -167,6 +202,8 @@ def forward(self, b_buffer, x):
return (view_as_1,)""", # noqa: B950
)
# During tracing, sharding propagation cache is skipped, so an extra dry run for
# add is performed in _propagate_tensor_meta_non_cached, hence add_1 instead of add
self.assertExpectedInline(
str(ep.run_decompositions({}).graph_module.code).strip(),
"""\
@ -174,8 +211,8 @@ def forward(self, b_parametrizations_buffer_original0, x):
_assert_tensor_metadata = torch.ops.aten._assert_tensor_metadata.default(x, None, None, torch.float64, device = device(type='cpu'), layout = torch.strided); _assert_tensor_metadata = None
_to_copy = torch.ops.aten._to_copy.default(x, dtype = torch.float64, layout = torch.strided, device = device(type='cuda', index=0)); x = None
view = torch.ops.aten.view.default(_to_copy, [4, 4]); _to_copy = None
add = torch.ops.aten.add.Tensor(b_parametrizations_buffer_original0, view); b_parametrizations_buffer_original0 = view = None
view_1 = torch.ops.aten.view.default(add, [4, 4]); add = None
add_1 = torch.ops.aten.add.Tensor(b_parametrizations_buffer_original0, view); b_parametrizations_buffer_original0 = view = None
view_1 = torch.ops.aten.view.default(add_1, [4, 4]); add_1 = None
return (view_1,)""", # noqa: B950
)
@ -269,7 +306,9 @@ def forward(self, b_parametrizations_buffer_original0, x):
.to_local()[0]
)
x = DTensor.from_local(torch.rand(4, 4), mesh, [Shard(0)], run_check=False)
x = DTensor.from_local(
torch.rand(4, 4, requires_grad=True), mesh, [Shard(0)], run_check=False
)
torch._dynamo.mark_dynamic(x, 0)
ref = fn(x)
@ -290,7 +329,9 @@ def forward(self, b_parametrizations_buffer_original0, x):
for t in torch.tensor_split(x, 2)
]
x = DTensor.from_local(torch.rand(4, 4), mesh, [Shard(0)], run_check=False)
x = DTensor.from_local(
torch.rand(4, 4, requires_grad=True), mesh, [Shard(0)], run_check=False
)
ref = fn(x)
opt_fn = torch.compile(fn, backend="aot_eager", fullgraph=True, dynamic=True)
@ -317,6 +358,30 @@ def forward(self, b_parametrizations_buffer_original0, x):
res = opt_fn(x)
self.assertEqual(res, ref)
def test_dtensor_dynamic_cat(self):
mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
# test passing in tuple of DTensors as
def fn(x, y):
return (
torch.cat((x, y), dim=0)
.redistribute(device_mesh=x.device_mesh, placements=[Replicate()])
.to_local()[0]
)
x = DTensor.from_local(
torch.rand(4, 4, requires_grad=True), mesh, [Shard(0)], run_check=False
)
y = DTensor.from_local(
torch.rand(4, 4, requires_grad=True), mesh, [Shard(0)], run_check=False
)
torch._dynamo.mark_dynamic(x, 0)
ref = fn(x, y)
opt_fn = torch.compile(fn, backend="aot_eager", fullgraph=True)
res = opt_fn(x, y)
self.assertEqual(res, ref)
def test_dtensor_attribute_access_on_intermediate(self):
mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
@ -1150,6 +1215,29 @@ class TestDTensorCompileE2E(DTensorTestBase):
self.assertEqual(x_ref.grad, x.grad)
self.assertEqual(y_ref.grad, y.grad)
@with_comms
def test_compile_embedding_redistribute(self):
mesh = self.build_device_mesh()
class Network(nn.Module):
def __init__(self, embedding, mesh):
super().__init__()
self.mesh = mesh
self.embedding = _apply_sharding(embedding, 0, self.mesh)
def forward(self, x):
x = self.embedding(x)
x = x.redistribute(self.mesh, [Shard(1)])
return x
embedding = torch.nn.Embedding(10, 20, device=self.device_type)
inp = torch.randint(0, 10, (8,), device=self.device_type)
ref_out = embedding(inp)
sharded_net = torch.compile(Network(embedding, mesh))
replicated_inp = DTensor.from_local(inp, mesh, [Replicate()], run_check=False)
output = sharded_net(replicated_inp)
self.assertEqual(output.full_tensor(), ref_out)
if __name__ == "__main__":
run_tests()

View File

@ -1,41 +0,0 @@
# Copyright (c) Meta Platforms, Inc. and affiliates
# Owner(s): ["oncall: distributed"]
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed.tensor import DTensor
from torch.distributed.tensor.placement_types import Shard
from torch.testing._internal.common_utils import run_tests, TestCase
from torch.testing._internal.distributed.fake_pg import FakeStore
class TestFakeDTensor(TestCase):
def test_fake_dtensor_operations(self):
# Use FakeTensorMode to handle CUDA tensors without actual CUDA
fake_mode = FakeTensorMode()
world_size = 4
fake_store = FakeStore()
torch.distributed.init_process_group(
"fake", store=fake_store, rank=0, world_size=world_size
)
device_mesh = torch.distributed.device_mesh.init_device_mesh(
"cuda",
(2, world_size // 2),
)
# Create fake CUDA tensor using FakeTensorMode
with fake_mode:
x = torch.randn(1, 1, device="cuda")
x = DTensor.from_local(x, device_mesh, [Shard(0), Shard(1)])
# Test basic DTensor operations
self.assertIsInstance(x, DTensor)
# Test sum operation
r = x.sum(1)
self.assertIsInstance(r, DTensor)
if __name__ == "__main__":
run_tests()

View File

@ -24,7 +24,7 @@ from torch.distributed.tensor.parallel import (
RowwiseParallel,
SequenceParallel,
)
from torch.testing._internal.common_utils import run_tests, skipIfRocm
from torch.testing._internal.common_utils import run_tests
from torch.testing._internal.distributed._tensor.common_dtensor import (
DTensorTestBase,
skip_unless_torch_gpu,
@ -695,7 +695,6 @@ class DistMathOpsTest(DTensorTestBase):
self.assertEqual(grad1_norm.device_mesh, mesh_y)
@with_comms
@skipIfRocm
def test_foreach_add_different_mesh(self):
mesh_shape = (2, self.world_size // 2)
mesh_2d = init_device_mesh(

View File

@ -1,6 +1,8 @@
# Copyright (c) Meta Platforms, Inc. and affiliates
# Owner(s): ["oncall: distributed"]
import itertools
import torch
from torch.distributed.tensor import (
DeviceMesh,
@ -93,6 +95,19 @@ class DistTensorOpsTest(DTensorTestBase):
dst_tensor.copy_(src_tensor)
self.assertEqual(dst_dtensor.full_tensor(), dst_tensor)
# as a pointwise op, need to keep Partial placements without redistribute
src_tensor = torch.randn((64, 1))
dst_tensor = torch.zeros(16, 32, 64, 128)
src_specs = [[Partial()]]
dst_specs = [[Partial()]]
for dst_spec, src_spec in zip(dst_specs, src_specs):
src_dtensor = DTensor.from_local(src_tensor, device_mesh, src_spec)
dst_dtensor = DTensor.from_local(dst_tensor, device_mesh, dst_spec)
dst_dtensor.copy_(src_dtensor)
dst_tensor.copy_(src_tensor)
self.assertEqual(dst_dtensor.placements, (Partial(),))
self.assertEqual(dst_dtensor._local_tensor, dst_tensor)
@with_comms
def test_contiguous(self):
device_mesh = self.build_device_mesh()
@ -776,6 +791,36 @@ class DistTensorOpsTest(DTensorTestBase):
dim=split_dim,
)
@with_comms
def test_unbind(self):
device_mesh = self.build_device_mesh()
shard_dims = [0, 1]
unbind_dims = [0, 1]
local_tensor = torch.randn(4, 8, requires_grad=True)
for shard_dim, unbind_dim in itertools.product(shard_dims, unbind_dims):
dist_tensor = distribute_tensor(
local_tensor, device_mesh, (Shard(shard_dim),)
)
if shard_dim == unbind_dim:
with self.assertRaisesRegex(
RuntimeError, "Sharding propagation failed"
):
dist_tensor.unbind(dim=unbind_dim)
else:
unbinded_dist_tensors = dist_tensor.unbind(dim=unbind_dim)
new_shard_dim = shard_dim if shard_dim < unbind_dim else shard_dim - 1
self.assertTrue(
all(
elem.placements[0].is_shard(dim=new_shard_dim)
for elem in unbinded_dist_tensors
)
)
for x, y in zip(
unbinded_dist_tensors, local_tensor.unbind(dim=unbind_dim)
):
self.assertEqual(x.full_tensor(), y)
if __name__ == "__main__":
run_tests()

View File

@ -33,7 +33,6 @@ from torch.testing._internal.common_distributed import (
from torch.testing._internal.common_utils import (
run_tests,
skip_but_pass_in_sandcastle_if,
skipIfRocm,
TEST_WITH_DEV_DBG_ASAN,
)
@ -319,7 +318,6 @@ class ProcessGroupNCCLOpTest(MultiProcContinuousTest):
@requires_nccl()
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "NCCL test requires 2+ GPUs")
@skipIfRocm()
def test_nccl_watchdog_cudagraph(self):
# test that the watchdog does not crash graphs with disallowed event query
pg = self.pg

View File

@ -29,7 +29,6 @@ from torch.testing._internal.common_distributed import (
requires_accelerator_dist_backend,
)
from torch.testing._internal.common_fsdp import get_devtype
from torch.testing._internal.common_utils import skipIfRocm
from torch.testing._internal.inductor_utils import HAS_GPU
@ -368,7 +367,6 @@ class TestComputeCommReorderingMultiProc(DynamoDistributedMultiProcTestCase):
self.assertTrue(same(out, correct))
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
@skipIfRocm
# TODO: somehow inductor bg compile threads are causing hangs at exit with distributed work dtor
@patch.object(torch._inductor.config, "compile_threads", 1)
@patch.object(

View File

@ -8,11 +8,7 @@ from dataclasses import dataclass
import torch
from torch.multiprocessing.reductions import reduce_tensor
from torch.testing._internal.common_distributed import MultiProcContinuousTest
from torch.testing._internal.common_utils import (
requires_cuda_p2p_access,
run_tests,
skipIfRocm,
)
from torch.testing._internal.common_utils import requires_cuda_p2p_access, run_tests
# So that tests are written in device-agnostic way
@ -63,7 +59,6 @@ class CupyAsTensorTest(MultiProcContinuousTest):
def device(self) -> torch.device:
return torch.device(device_type, self.rank)
@skipIfRocm
def test_cupy_as_tensor(self) -> None:
"""
Test that torch.as_tensor works for cupy array interface

View File

@ -246,14 +246,16 @@ class DeviceMeshTest(DTensorTestBase):
@with_comms
def test_device_mesh_init_backend(self):
mesh = DeviceMesh(self.device_type, [1], _init_backend=False)
mesh = DeviceMesh(
self.device_type, torch.arange(10), _init_backend=False, _rank=5
)
with self.assertRaisesRegex(RuntimeError, "process groups not initialized!"):
mesh.get_group()
# coordinates should always been populated when init_backend is False, as whenever
# we call init_backend we should make sure the default pg already created
mesh.get_coordinate()
self.assertEqual(mesh.get_coordinate(), [5])
def test_fake_pg_device_mesh(self):
fake_store = FakeStore()
@ -823,6 +825,15 @@ class TestDeviceMeshGetItem(DTensorTestBase):
):
mesh_3d["cp", "dp"]
@with_comms
def test_flatten_mesh_1d(self):
mesh_shape = (4,)
mesh_dim_names = ("default",)
mesh_1d = init_device_mesh(
self.device_type, mesh_shape, mesh_dim_names=mesh_dim_names
)
mesh_1d._flatten()
@with_comms
def test_flatten_mesh_3d(self):
mesh_shape = (2, 2, 2)
@ -831,6 +842,13 @@ class TestDeviceMeshGetItem(DTensorTestBase):
self.device_type, mesh_shape, mesh_dim_names=mesh_dim_names
)
# Test flatten into an existing mesh_dim_name inside the mesh
with self.assertRaisesRegex(
RuntimeError,
"already exists for submesh of the DeviceMesh",
):
mesh_3d._flatten("dp")
# Test flatten contiguous dims
dp_cp_mesh = mesh_3d["dp", "cp"]
flattened_dp_cp_mesh = dp_cp_mesh._flatten()

View File

@ -45,6 +45,8 @@ from torch.testing._internal.common_utils import (
parametrize,
requires_cuda,
skipIfRocm,
TEST_XPU,
xfailIf,
)
from torch.testing._internal.inductor_utils import HAS_GPU
from torch.utils._python_dispatch import TorchDispatchMode
@ -266,6 +268,7 @@ class TestCollectivesMultiProc(DynamoDistributedMultiProcTestCase):
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
@skip_if_lt_x_gpu(2)
@xfailIf(TEST_XPU) # https://github.com/intel/torch-xpu-ops/issues/1728
@skipIfRocm
def test_eager_async_allreduce_inductor_wait(self):
import torch.distributed as dist
@ -1528,7 +1531,8 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
@unittest.skipIf(not SM80OrLater, "bfloat16")
def test_all_gather_bucket(self):
@parametrize("bucket_mode", ["all", "all_custom_ops"])
def test_all_gather_bucket(self, bucket_mode):
def func(x, w, ag_0, ag_1, ag_2, ag_3, *, tag, ranks, group_size):
# do some unrelated matmuls
y = torch.mm(x, w)
@ -1576,7 +1580,7 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
with (
torch._inductor.config.patch(
{
"bucket_all_gathers_fx": "all",
"bucket_all_gathers_fx": bucket_mode,
"reorder_for_compute_comm_overlap": False,
"runtime_estimations_mms_benchmark": True,
}
@ -1595,7 +1599,9 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
# We want to make sure no unnecessary copy is made.
(
FileCheck()
.check_count(".all_gather_into_tensor_out.default(", 2, exactly=True)
.check("= torch.ops._c10d_functional.all_gather_into_tensor")
.check("torch.ops._c10d_functional.all_gather_into_tensor_out.default(")
.check("= torch.ops._c10d_functional.all_gather_into_tensor")
.run(code)
)
out = compiled(*inputs, **self.get_world_trs())
@ -1656,7 +1662,8 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
@unittest.skipIf(not SM80OrLater, "bfloat16")
def test_reduce_scatter_bucket(self):
@parametrize("bucket_mode", ["all", "all_custom_ops"])
def test_reduce_scatter_bucket(self, bucket_mode):
def func(x, w, rs_0, rs_1, tag, ranks, group_size):
# do some unrelated matmuls
y = torch.mm(x, w)
@ -1697,7 +1704,7 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
with torch._inductor.config.patch(
{
"bucket_reduce_scatters_fx": "fsdp",
"bucket_reduce_scatters_fx": bucket_mode,
"reorder_for_compute_comm_overlap": False,
}
):
@ -1723,7 +1730,8 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
@unittest.skipIf(not SM80OrLater, "bfloat16")
def test_reorder_peak_memory_bucketed(self):
@parametrize("bucket_mode", ["all", "all_custom_ops"])
def test_reorder_peak_memory_bucketed(self, bucket_mode):
"""
Simulate the case where a bucketing pass ran and grouped several inputs into one bucketed allgather.
Ensure the whole bucketed group including copy-ops get moved together rather than the copy ops preventing the
@ -1837,9 +1845,9 @@ class TestCollectivesInductor(DynamoDistributedSingleProcTestCase):
with (
torch._inductor.config.patch(
{
"bucket_all_gathers_fx": "all",
"bucket_all_gathers_fx": bucket_mode,
"bucket_all_gathers_fx_bucket_size_determinator": lambda _: 2,
"bucket_reduce_scatters_fx": "all",
"bucket_reduce_scatters_fx": bucket_mode,
"bucket_reduce_scatters_fx_bucket_size_determinator": lambda _: 2,
"reorder_for_compute_comm_overlap": True,
"reorder_for_compute_comm_overlap_passes": [

View File

@ -7,7 +7,11 @@
import torch
import torch.distributed as dist
import torch.distributed._symmetric_memory as symm_mem
from torch.testing._internal.common_distributed import MultiProcContinuousTest
from torch.distributed.device_mesh import init_device_mesh
from torch.testing._internal.common_distributed import (
MultiProcContinuousTest,
skip_if_lt_x_gpu,
)
from torch.testing._internal.common_utils import (
instantiate_parametrized_tests,
parametrize,
@ -544,17 +548,11 @@ class NVSHMEMAll2AllTest(MultiProcContinuousTest):
# Check data
torch.testing.assert_close(out_expected, out[:out_numel])
@skipIfRocm
@parametrize("align", [1, 8, 16]) # `major_align` of output
def test_shuffle_combine(self, align: int) -> None:
def helper_test_dispatch_combine(self, align: int, group_name) -> None:
"""
Shuffle the tokens, then combine them, and check if the combined data is
exactly the same as the original input data
"""
torch.manual_seed(42 + self.rank)
self._init_device()
group_name = dist.group.WORLD.group_name
symm_mem.enable_symm_mem_for_group(group_name)
dtype = torch.float
@ -628,6 +626,36 @@ class NVSHMEMAll2AllTest(MultiProcContinuousTest):
).to(torch.int64)
torch.testing.assert_close(combine_out_splits_offsets[1], inp_offsets)
@skipIfRocm
@parametrize("align", [1, 8, 16]) # `major_align` of output
def test_dispatch_combine(self, align: int) -> None:
"""
Test dispatch-and-combine over World group
"""
torch.manual_seed(42 + self.rank)
self._init_device()
self.helper_test_dispatch_combine(align, dist.group.WORLD.group_name)
@skipIfRocm
# TODO: FIXIT. Currently, `MultiProcContinuousTest` treats the skip code as a
# failure
@skip_if_lt_x_gpu(4)
def test_dispatch_combine_subgroup(self) -> None:
"""
Test dispatch-and-combine over concurrent subgroups
"""
torch.manual_seed(42 + self.rank)
self._init_device()
symm_mem.enable_symm_mem_for_group(dist.group.WORLD.group_name)
# Test on two concurrent subgroups
ngroups = 2
subgroup_size = self.world_size // ngroups
dm = init_device_mesh(
device_type, (ngroups, subgroup_size), mesh_dim_names=("dp", "ep")
)
subgroup = dm.get_group("ep")
self.helper_test_dispatch_combine(align=8, group_name=subgroup.group_name)
if __name__ == "__main__":
run_tests()

View File

@ -7,11 +7,7 @@
import torch
from torch.multiprocessing.reductions import reduce_tensor
from torch.testing._internal.common_distributed import MultiProcContinuousTest
from torch.testing._internal.common_utils import (
requires_cuda_p2p_access,
run_tests,
skipIfRocm,
)
from torch.testing._internal.common_utils import requires_cuda_p2p_access, run_tests
# So that tests are written in device-agnostic way
@ -34,7 +30,6 @@ class P2PIpcTest(MultiProcContinuousTest):
def device(self) -> torch.device:
return torch.device(device_type, self.rank)
@skipIfRocm
def test_p2p_ipc(self) -> None:
"""
Test that cross-process P2P access works, by reducing a tensor,

View File

@ -0,0 +1,90 @@
#!/usr/bin/env python3
# Owner(s): ["oncall: r2p"]
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.
import os
from unittest.mock import MagicMock, patch
import torch.distributed.run as run
from torch.distributed.launcher.api import launch_agent, LaunchConfig
from torch.testing._internal.common_utils import run_tests, TestCase
class RunTest(TestCase):
def setUp(self):
# Save original environment variable if it exists
self.original_signals_env = os.environ.get(
"TORCHELASTIC_SIGNALS_TO_HANDLE", None
)
def tearDown(self):
# Restore original environment variable
if self.original_signals_env is not None:
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"] = self.original_signals_env
elif "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
def test_signals_to_handle_default(self):
"""Test that the default value for signals_to_handle is correctly set."""
parser = run.get_args_parser()
args = parser.parse_args(["dummy_script.py"])
self.assertEqual(args.signals_to_handle, "SIGTERM,SIGINT,SIGHUP,SIGQUIT")
def test_signals_to_handle_custom(self):
"""Test that a custom value for signals_to_handle is correctly parsed."""
parser = run.get_args_parser()
args = parser.parse_args(
["--signals-to-handle=SIGTERM,SIGUSR1,SIGUSR2", "dummy_script.py"]
)
self.assertEqual(args.signals_to_handle, "SIGTERM,SIGUSR1,SIGUSR2")
def test_config_from_args_signals_to_handle(self):
"""Test that the signals_to_handle argument is correctly passed to LaunchConfig."""
parser = run.get_args_parser()
args = parser.parse_args(
["--signals-to-handle=SIGTERM,SIGUSR1,SIGUSR2", "dummy_script.py"]
)
config, _, _ = run.config_from_args(args)
self.assertEqual(config.signals_to_handle, "SIGTERM,SIGUSR1,SIGUSR2")
@patch("torch.distributed.launcher.api.LocalElasticAgent")
@patch("torch.distributed.launcher.api.rdzv_registry.get_rendezvous_handler")
def test_launch_agent_sets_environment_variable(self, mock_get_handler, mock_agent):
"""Test that launch_agent sets the TORCHELASTIC_SIGNALS_TO_HANDLE environment variable."""
# Setup
config = LaunchConfig(
min_nodes=1,
max_nodes=1,
nproc_per_node=1,
signals_to_handle="SIGTERM,SIGUSR1,SIGUSR2",
)
entrypoint = "dummy_script.py"
args = []
# Make sure the environment variable doesn't exist before the test
if "TORCHELASTIC_SIGNALS_TO_HANDLE" in os.environ:
del os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"]
# Mock agent.run() to return a MagicMock
mock_agent_instance = MagicMock()
mock_agent_instance.run.return_value = MagicMock(
is_failed=lambda: False, return_values={}
)
mock_agent.return_value = mock_agent_instance
# Call launch_agent
launch_agent(config, entrypoint, args)
# Verify that the environment variable was set correctly
self.assertEqual(
os.environ["TORCHELASTIC_SIGNALS_TO_HANDLE"], "SIGTERM,SIGUSR1,SIGUSR2"
)
if __name__ == "__main__":
run_tests()

View File

@ -644,7 +644,7 @@ class SymmMemEmptySetDeviceTest(MultiProcessTestCase):
symm_mem_hdl.barrier()
@runOnRocmArch(MI300_ARCH)
@skipIfRocm
@skip_if_lt_x_gpu(2)
@parametrize("set_device", [True, False])
def test_empty_strided_p2p(self, set_device: bool) -> None:

View File

@ -18,7 +18,7 @@ from torch._dynamo.backends.common import aot_autograd
from torch._dynamo.testing import CompileCounterWithBackend
from torch._higher_order_ops.wrap import tag_activation_checkpoint
from torch.testing._internal.common_device_type import instantiate_device_type_tests
from torch.testing._internal.common_utils import IS_WINDOWS, skipIfHpu, skipIfRocm
from torch.testing._internal.common_utils import IS_WINDOWS, skipIfHpu
from torch.testing._internal.inductor_utils import HAS_CUDA_AND_TRITON
from torch.testing._internal.triton_utils import requires_cuda_and_triton
from torch.testing._internal.two_tensor import TwoTensor
@ -1364,7 +1364,6 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
self.assertEqual(out, out_compiled)
self.assertEqual(input.grad, input_compiled.grad)
@skipIfRocm
@requires_cuda_and_triton
def test_autocast_flash_attention(self, device):
def fn(primals_1, primals_2, primals_3):

View File

@ -726,14 +726,14 @@ Call to `torch._dynamo.graph_break()`
Unsupported,
lambda: torch.compile(fn, backend="eager", fullgraph=True)(),
"""\
LOAD_BUILD_CLASS bytecode not supported
Explanation: Dynamo does not support tracing classes that are defined in the compiled region.
Hint: Move the class definition out of the compiled region.
Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
Attempted to call function marked as skipped
Explanation: Dynamo does not know how to trace the builtin `builtins.__build_class__.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
Hint: If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
Hint: If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
Developer debug context:
Developer debug context: module: builtins, qualname: __build_class__, skip reason: <missing reason>
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0075.html
For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0007.html
from user code:
File "test_error_messages.py", line N, in fn

Some files were not shown because too many files have changed in this diff Show More