Compare commits

..

333 Commits

Author SHA1 Message Date
46eaea0232 update test_quantization tests to run weekly 2025-09-23 09:02:34 -07:00
6b231af23d [lint][CI] Don't checkout submodules for lintrunner-noclang (#162844)
Shouldn't be needed?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162844
Approved by: https://github.com/huydhn
2025-09-15 17:29:31 +00:00
19a4ef0256 [DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh (#162414)
We create a wrapper class named "_MeshLayout" acting as a layout for device mesh so that we can add new methods more specific to DeviceMesh and keep the core logic of CuTe manipulation inside pycute module. This PR create the main body of the code and then next PR will come with actual implementation and unit test for device mesh layout. (Actual implementation can be found in https://github.com/pytorch/pytorch/pull/161016)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162414
Approved by: https://github.com/ezyang, https://github.com/fegin
ghstack dependencies: #162413, #162534
2025-09-15 17:04:41 +00:00
9cd54d3443 Clean up 'torch.onnx' entries from public API allowlist (#162850)
Clean up entries related to 'torch.onnx' from the allowlist as the apis in onnx are properly configured.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162850
Approved by: https://github.com/albanD
2025-09-15 16:14:43 +00:00
0826aafa04 [ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)
Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162330
Approved by: https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
2025-09-15 16:13:03 +00:00
5dc4e78047 Fix excess refcounting in ObjLoaderFunc (#161528)
expectRef is preferred over expect because it doesn't copy a std::shared_ptr.

Differential Revision: [D81053710](https://our.internmc.facebook.com/intern/diff/D81053710/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161528
Approved by: https://github.com/Skylion007
2025-09-15 16:05:50 +00:00
c9e57d7e9f [CI] Move libtorch-cpu-shared-with-deps-release-build to python 3.10 (#162877)
Related to https://github.com/pytorch/pytorch/pull/162862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162877
Approved by: https://github.com/malfet
2025-09-15 15:27:25 +00:00
70337a066f [easy] Handle Autotuners in get_triton_source_codes_for_gm (#161914)
Some triton kernels are autotuners, in that case, grab the function from the autotuner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161914
Approved by: https://github.com/oulgen
2025-09-15 15:19:04 +00:00
7d1bcd9aea [easy] Fix unsigned long issue in static cuda launcher (#162920)
Fixes https://github.com/pytorch/pytorch/issues/162430

It's a bit hard to come up with a unit test where the stream exceeds a C++ long, so just using existing unit tests for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162920
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-09-15 15:00:32 +00:00
09cbf34e93 [BE] Preserve caller source location in the error message (#162808)
Summary:
Currently the C10_CUDA_CHECK only shows source location in CUDAException like below:
```
Exception raised from c10_cuda_check_implementation at fbcode/caffe2/c10/cuda/CUDAException.cpp:44
```
which is not terribly useful.

By checking the original diff D39619861 that introduced c10_cuda_check_implementation, it seems the original macro would show the source location correctly but c10_cuda_check_implementation broke it.

This diff will propagate caller source location to c10_cuda_check_implementation to fix the issue.

Test Plan:
CI

Observed desired error message after the change:
```
CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Device-side assertion tracking was not enabled by user.
Exception raised from operator() at fbcode/sigrid/predictor/aed/AedContainer.cpp:659 (most recent call first):
```

Note the last line reports actual caller location.

Rollback Plan:

Reviewed By: Raymo111

Differential Revision: D81880552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162808
Approved by: https://github.com/janeyx99
2025-09-15 13:29:43 +00:00
456fbeaa6d [xla hash update] update the pinned xla hash (#162947)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162947
Approved by: https://github.com/pytorchbot
2025-09-15 11:42:02 +00:00
a8c80f3fa9 Update slow tests (#162946)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162946
Approved by: https://github.com/pytorchbot
2025-09-15 11:31:37 +00:00
bf6b40da3e fix deterministic scatter_add path for multi-d tensors (#162866)
PReviously for more than 2d tensor `select` didn't work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866
Approved by: https://github.com/valentinandrei
2025-09-15 06:50:00 +00:00
814ba34fa6 [2/N] Port 5 _composable distributed test to Intel GPU (#159241)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This is the second PR for _composable cases, the first is https://github.com/pytorch/pytorch/pull/159118.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- Use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- Enabled XPU for some test path
- Skip some test cases which Intel GPU does not support
- Added "cpu:gloo,xpu:xccl" for distributed backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159241
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-15 06:24:58 +00:00
06bb32d55e Skip empty tests, they don't make sense for numerics (#162932)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162932
Approved by: https://github.com/dcci
2025-09-15 06:20:26 +00:00
b3ad8f4a9c [BUG] Fix nonzero_static crash on CUDA when the input is a empty tensor (#162578)
Fixes #162473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162578
Approved by: https://github.com/ngimel
2025-09-15 05:44:15 +00:00
755cf90672 Redirect all use of filesystem to c10/utils/FileSystem.h (#162914)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162914
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever
2025-09-15 04:30:41 +00:00
76e5df3866 [BE] Use fmt::format to define Conv key (#162925)
Also use `getArrayRefString` instead of having separate cases for 2D and 3D Conv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162925
Approved by: https://github.com/Skylion007
ghstack dependencies: #162921
2025-09-15 02:44:12 +00:00
7fe1f5ea49 [BE] Delete [Ventura|Sonoma]Ops header (#162921)
Was a temp solution to make PyTorch+MPS buildable on MacOS-12, but it's no longer needed, as in 2.9+ MPS is only supported on MacOS Sonoma+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162921
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-09-15 02:44:12 +00:00
e156a07171 [Precompile] [RFC] Implement aot_compile_module (#162171)
This PR adds a new interface _aot_compile to `OptimizedModule`, so that the following is possible:

```
mod = SimpleLinearModule()
inputs = [
            ModelInput(
                args=(torch.randn(3, 3),),
                kwargs={},
                contexts=[torch.no_grad(), eval_mode(model)],
            ),
            ModelInput(
                args=(torch.randn(3, 3),), kwargs={}, contexts=[train_mode(model)]
            ),
        ]
        assert isinstance(model, torch._dynamo.eval_frame.OptimizedModule)
        model._aot_compile(
            inputs,
        )
```

After this PR, you can AOT precompile NanoGPT and use it to train directly. I'll share my fork of the repo to make this work.

## ModelInput
The `ModelInput` API is a work in progress; for now it represents a set of inputs and contexts to instruct the compiler to compile. Most commonly, this is "compile an eval mode with no grad, and  a training mode with grad", but also contains things like autocasting contexts, etc.

## Dispatch
Dispatching is super simple here, we just iterate through all the precompiled fullgraphs and check guards for each one until there's one htat passes. I'm a bit worried that having this in python code is going to be too expensive. The guard checks are happening in C++ anyway, though, so the only python bottlenecked step here is just the for loop, so perhaps the overhead will not be high. I'll work on measuring this, though.

## TODOs

This PR does not support `mod.compile()`, only `torch.compile(mod)`. In order to support `mod.compile()`, we'll need to update torch.nn.Module with an updated implementation — I can add that frontend later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162171
Approved by: https://github.com/zhxchen17
2025-09-14 23:32:28 +00:00
ba5ca31676 [MPS] sparse mps any (#162885)
Add SparseMPS key for any op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162885
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-14 18:57:53 +00:00
8e1db46493 [MPS] enable empty like and unsqueeze for SparseMPS (#162910)
Enable empty like and unsqueeze for SparseMPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162910
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-14 17:47:06 +00:00
aff2438554 QoL: add pip to requirements-build.txt (#162896)
uv venvs by default don't come with pip, but for example setup.py assumes it is available.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162896
Approved by: https://github.com/Skylion007
2025-09-14 17:08:05 +00:00
3f8a2e62ea Fix rebind_unbacked in torch.fx.experimental.symbolic_shapes (#162788)
## Description
Fix a float type handling in `torch.fx.experimental.symbolic_shapes` function. [#162480](https://github.com/pytorch/pytorch/issues/162480)

## Issue
When I use AOTInductor to compile the YOLOv10, I encounter the bug `'float' object has no attribute 'node'`.
[Torch AOTInductor Ahead-Of-Time Compilation Fail](https://github.com/opendatalab/DocLayout-YOLO/issues/177)

The problem is due to missing float type handling.
https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/symbolic_shapes.py#L597
```
            if isinstance(u1, int):
                log.info(
                    "rebind_unbacked: discard %s %s %s -> %s",
                    n.target,
                    raw_u0,
                    path,
                    u1,
                )
                continue
```

## Solution
Change the code `if isinstance(u1, float)` to `if isinstance(u1, (int,float))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162788
Approved by: https://github.com/ezyang
2025-09-14 17:07:14 +00:00
6d64bc3990 [data foundation][vizard] Prevent checking the device type of numpy object in Tensorboard logger (#162888)
Summary:
The check is introduced in D82262053
- `scalar_value` could be a numpy object
  - Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`.

Test Plan:
```
vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10
```

Rollback Plan:

Differential Revision: D82383428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888
Approved by: https://github.com/xush6528
2025-09-14 08:09:08 +00:00
972140b7e9 [benchmark] Add HF LLM benchmarks (#156967)
Results in https://docs.google.com/spreadsheets/d/1xXOPg9JjEmPx0zc5QBNdyXQq8-K2_r4ybHaiS-q7pZ0/edit?gid=88695043#gid=88695043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156967
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-14 07:41:06 +00:00
84186c39ed [NVRTC] Enable compiling templated kernels (#162875)
Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps

NVRTC side
- (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>`
- `nvrtcCompileProgram`
- (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed
- `nvrtcDestroyProgram`

CUDA side
- use mangled name instead of normal name -> profit
- `extern "C"` is not even needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875
Approved by: https://github.com/msaroufim
2025-09-14 06:17:36 +00:00
74a35c6344 [Triton] [Inductor] Enable TMA store for TMA mm templates (#160480)
Summary:
Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support.

This also includes a couple notable changes:
- Adds support in the TMA template support for checking the output layout.
- Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation.
- Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches.

Test Plan:
Tested with test_max_autotune.py unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480
Approved by: https://github.com/NikhilAPatel
2025-09-14 04:56:49 +00:00
d2f6daf6a7 [audio hash update] update the pinned audio hash (#162892)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162892
Approved by: https://github.com/pytorchbot
2025-09-14 04:27:37 +00:00
e74b21d66a [vllm hash update] update the pinned vllm hash (#162891)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162891
Approved by: https://github.com/pytorchbot
2025-09-14 04:27:35 +00:00
f01bf0f64b Do not use // but use CleanDiv or FloorDiv instead (#162869)
Summary:
When rewriting sympy expressions in the compiler codebase we want to generate
FloorDiv(a, b) CleanDiv(a, b) directly and not a//b. since the later become floor(a*pow(b, -1))

For symnodes we automatically handle that conversions in the symnode op dispatch.
I will follow up with an issue to track all other usages of //.
Block internal Model.

Test Plan:
add test
run existing tests.
dakechen1993 testing on the model.

Rollback Plan:

Differential Revision: D82362241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162869
Approved by: https://github.com/ezyang
2025-09-14 01:30:33 +00:00
886699bc5c Port shared_ptr optimization in std::shared_ptr to intrusive_ptr (#162784)
Summary:
Please see D21021645 for details about the optimization and why it's beneficial.

A similar change has been added to libstdc++ as well, see dbf8bd3c2f

Rollback Plan:

Reviewed By: yfeldblum

Differential Revision: D81960754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162784
Approved by: https://github.com/swolchok
2025-09-13 21:01:00 +00:00
72b5159782 [flatbuffer] Fix compile error due to discarded result (#162767)
Summary:
One of our builds fails because the return value of fread is discarded. Explicit cast to void fixes the build.

```log
In file included from fbcode/caffe2/torch/csrc/jit/mobile/import.cpp:15:
fbcode/caffe2/torch/csrc/jit/mobile/file_format.h:156:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
  156 |   fread(data.get(), size, 1, f);
      |   ^~~~~ ~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
...
BUILD FAILED
Failed to build 'fbcode//caffe2:libtorch (cfg:opt-linux-x86_64-clang19-no-san-opt-by-default#fef256f7ee896871)'
```

Test Plan:
No runtime behavior change. CI.

Rollback Plan:

Differential Revision: D82265002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162767
Approved by: https://github.com/Skylion007
2025-09-13 20:24:43 +00:00
f37eaebed1 Add missing tags parameter to custom_op overload signatures (#162047)
It appears to be an omission in #149782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162047
Approved by: https://github.com/zou3519, https://github.com/BoyuanFeng

Co-authored-by: Boyuan Feng <fby.1994@gmail.com>
2025-09-13 19:57:23 +00:00
5b9114bf19 Revert "[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)"
This reverts commit 62843c14bbf694f5722fd6e1075da4792507fe42.

Reverted https://github.com/pytorch/pytorch/pull/162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see https://github.com/pytorch/pytorch/issues/162881 ([comment](https://github.com/pytorch/pytorch/pull/162330#issuecomment-3288544921))
2025-09-13 15:43:50 +00:00
deb7ebe0a3 Revert "[Reland] Use std::string_view in torchgen (#158625)"
This reverts commit 972e409829343cc2062aeee0994a9c1c735d216a.

Reverted https://github.com/pytorch/pytorch/pull/158625 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break a couple of ExecuTorch tests for Vulkan backend ([comment](https://github.com/pytorch/pytorch/pull/158625#issuecomment-3287754275))
2025-09-13 07:52:50 +00:00
9c93dc8123 Revert "Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532)"
This reverts commit a956c4ab1cb13079203a8f07eb26218724f54dc8.

Reverted https://github.com/pytorch/pytorch/pull/160532 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160532#issuecomment-3287745165))
2025-09-13 07:42:12 +00:00
31040b6357 Revert "port some distributed tensor test files for Intel GPU (#161703)"
This reverts commit 179f10621b418427fc6e92f58ea2b0bbe4cc9c52.

Reverted https://github.com/pytorch/pytorch/pull/161703 on behalf of https://github.com/huydhn due to Sorry for reverting your change but these tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/161703#issuecomment-3287720713))
2025-09-13 07:22:14 +00:00
aa41d3e49c Claude loves making these files in top level, ignore them for sanity. (#162806)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162806
Approved by: https://github.com/albanD
2025-09-13 04:59:00 +00:00
f0fcf436c5 [audio hash update] update the pinned audio hash (#162864)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162864
Approved by: https://github.com/pytorchbot
2025-09-13 04:17:21 +00:00
5663910472 [vllm hash update] update the pinned vllm hash (#162751)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162751
Approved by: https://github.com/pytorchbot
2025-09-13 04:16:51 +00:00
da669d51bf fusion of large accumulated reads only at ir level (#161978)
This is to revert some of the changes in https://github.com/pytorch/pytorch/pull/158667

In particular, we only disallow fusion of large accumulate read at IR level and not at scheduler level, as users can create their own custom fusion logics for the scheduler level.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161978
Approved by: https://github.com/yf225
2025-09-13 04:07:25 +00:00
783985e9fe kjt pytree registration (#161114)
Differential Revision: D80656182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161114
Approved by: https://github.com/henryoier
2025-09-13 03:57:43 +00:00
49d30f9a23 Fix boxcox to return same result for same input in one batch (#162772)
Summary:
The SIMD path is using SLEEF version of `pow` which is slightly different from `std::pow`.  The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results.

Rollback Plan:

Differential Revision: D82265247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162772
Approved by: https://github.com/swolchok
2025-09-13 03:57:35 +00:00
66133b1ab7 Build vLLM aarch64 nightly wheels (#162664)
PyTorch has published its aarch64 nightly wheels for all CUDA version after https://github.com/pytorch/pytorch/pull/162364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162664
Approved by: https://github.com/atalman
2025-09-13 03:43:55 +00:00
543d50db2b Fix torch export with dict input nested in args (#162618)
Investigated together with @pyemma and @taotaohuang001

## Problem
when calling exported module with dict nested in the args tuple, it will make following complaits
```
Traceback (most recent call last):
  File "/home/chzhu/infinitrain/test_torch_export.py", line 32, in <module>
    print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)}))
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 848, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 424, in __call__
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 411, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl
    return inner()
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1806, in inner
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
    return fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 81, in _check_input_constraints_pre_hook
    flat_args_with_path = _check_inputs_match(args, kwargs, self._in_spec)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 64, in _check_inputs_match
    raise ValueError(  # noqa: B904
ValueError: Trying to flatten user inputs with exported input tree spec:
TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a1', 'a2'], [*,
      *])]),
  TreeSpec(dict, [], [])])
but actually got inputs with tree spec of:
TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a2', 'a1'], [*,
      *])]),
  TreeSpec(dict, [], [])]).
Please check that the inputs have the same number and type of args and kwargs as the ones you used when tracing.

```

## How to reproduce the issue
```python
import torch

# create a nn.Module with data_batch as input and output as output
class MyModel(torch.nn.Module):
   def __init__(self):
       super(MyModel, self).__init__()
       self.linear = torch.nn.Linear(10, 1)

   def forward(self, data_batch):
       h1 = self.linear(data_batch["a1"])
       h2 = self.linear(data_batch["a2"])
       return h1 + h2

# torch export this module
model = MyModel()
example_args_forward = (
   {
       "a1": torch.randn(10),
       "a2": torch.randn(10),
   },
)
exported_model = torch.export.export(model, example_args_forward, strict=True)

# save the exported model
torch.export.save(exported_model, "exported_model.pt2")

# load the exported model
exported_model = torch.export.load("exported_model.pt2").module()

# run the exported model
print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)}))

```

## Root Cause
Input spec is encoded as [TreeSpec](582d278983/torch/utils/_pytree.py (L1059)) in torch export. With (args, kwargs) at the top level. When we call the exported model, it has a pre-execution [hook](582d278983/torch/export/_unlift.py (L66)) to check the input TreeSpec matches the received TreeSpec, where in Treespec, the dict key order is preserved. Something like

TreeSpec(dict, ['a2', 'a1'], [*,*])

To workaround this, the input check reorders [kwargs](582d278983/torch/export/_unlift.py (L67)), that is why kwargs can be out of order. But the dict nested in the args is not re-ordered, so any re-ordering of the keys will throw errors.

## Solution
Update eq_spec to handle the dict case, where we only guarantee that key set is the same without ordering constraints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162618
Approved by: https://github.com/angelayi
2025-09-13 03:24:30 +00:00
7dd5f7b125 Revert "python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580)"
This reverts commit 4b2d297eec425475a82934a52e0edd96805524a1.

Reverted https://github.com/pytorch/pytorch/pull/160580 on behalf of https://github.com/bdhirsh due to this broke shampoo, yanking ([comment](https://github.com/pytorch/pytorch/pull/160580#issuecomment-3287372891))
2025-09-13 02:04:36 +00:00
a956c4ab1c Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532)
Summary:

To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.
I expect the following pattern to work
```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])
    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)
```

Test Plan:
CI

Rollback Plan:

Differential Revision: D80181887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532
Approved by: https://github.com/henryoier, https://github.com/ezyang
2025-09-13 01:50:51 +00:00
0925c644ed [DCP] Decrease checkpoint background process Gloo pg init timeout (#162760)
Summary:
Sometimes checkpoint background process creation times out during gloo pg init.
Attempting to destroy the process during that time can block the trainer thread until the timeout completes.

This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time.

Test Plan:
CI

Rollback Plan:

Differential Revision: D81724668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760
Approved by: https://github.com/meetv18
2025-09-13 01:50:40 +00:00
b2553a6ec4 [AOTI] raise PyTorchStreamWriter open failed error code on windows (#162799)
When I debug AOTI UT: `TestAOTInductorPackage_cpu::test_add`.  I found it didn't output the verbose error code, when PyTorchStreamWriter open failed.

This PR add the verbose error code output for debug. Local test shows as below:
<img width="1124" height="653" alt="image" src="https://github.com/user-attachments/assets/01cb1a51-2982-4106-8b5b-c608ac26a075" />

The error code is 32, we can check the Windows error code 32 at https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-
```
ERROR_SHARING_VIOLATION
32 (0x20)
The process cannot access the file because it is being used by another process.
```

This issue is caused by the file is opened by another process. I fixed same issue in zip open as PR: https://github.com/pytorch/pytorch/pull/162617 But still no idea how to open file with shared access in `std::ofstream`. I will continue to researching it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162799
Approved by: https://github.com/jansel
2025-09-13 01:41:14 +00:00
a749c40342 [Bilinear] move check to reset_parameters (#160952)
Fixes #160407

### Summary:
Moved the check to reset_parameters to make `Bilinear` module lazy. Lazy modules have in_features initialized to 0 and a pre forward hook that initializes these to the appropriate shape, then calls reset parameters,

### Impact:
module: nn, linear.py

### Test:

<img width="903" height="182" alt="Screenshot From 2025-08-19 13-27-12" src="https://github.com/user-attachments/assets/bc04b0d6-5174-4dc9-8b21-9e019b3822a5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160952
Approved by: https://github.com/mikaylagawarecki
2025-09-13 01:17:10 +00:00
595e13feb7 [BE] [Inductor] Update NoValidChoicesError logic (#162814)
Summary: Updates the NoValidChoicesError logic to include some additional context for if not choices exists or if no choices compiled.

Test Plan:
NFC. Depending on CI.

Rollback Plan:

Differential Revision: D82312035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162814
Approved by: https://github.com/mlazos
2025-09-13 00:45:50 +00:00
ddc5107601 An improved heuristic for operator reordering for peak memory + debugging logs (#161810)
Revisiting the idea in https://github.com/pytorch/pytorch/pull/140195

For the lpmf algorithm in the memory reorder pass, in some cases, when all the nodes that can be scheduled are quite large, it is beneficial to switch the scheduling strategy. So instead of using size as the criterion, we choose a node that can unlock more nodes to become schedulable by analyzing their successor nodes.

For an internal use case, we observe up to 20 GiB memory difference and here are the before and after memory snapshot. More information can be found in [D81270682](https://www.internalfb.com/diff/D81270682) (internal only).

<img width="348" height="227" alt="image" src="https://github.com/user-attachments/assets/fb71e840-1508-44ed-bc9d-5eb4d364607d" />

In addition, add the functionality to upload the graph to tlparse for offline debugging. The format of the json is in consistency with the simulator [here](https://fburl.com/code/3l3d3qi4) (internal only).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161810
Approved by: https://github.com/yf225
2025-09-13 00:42:32 +00:00
a94ddd9b00 [OpenReg] Fix the docs of Accelerator Intergration (#162826)
----

- Fixed the redirect link about step 1
- Formatted the autoload and added necessary links
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162826
Approved by: https://github.com/albanD
ghstack dependencies: #161917, #161918, #160101
2025-09-12 23:53:17 +00:00
29f84b0f61 [OpenReg] Improve the Event and Stream capabilities of DeviceGuardImplInterface (#160101)
**Changes:**

- Based on `OpenRegStream` and `OpenRegEvent`, we improve the implementation of Device Guard for `OpenReg`
- Add some related testcases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160101
Approved by: https://github.com/albanD
ghstack dependencies: #161917, #161918
2025-09-12 23:53:17 +00:00
27daa6af6a [OpenReg] Strengthen Openreg's execution limits to minimize the waste of computing resources (#161918)
Currently, OpenReg supports Linux, Windows, and OS X, ensuring stability and ease of integration with third-party devices across all three platforms. It also doesn't rely on any other accelerators (such as CUDA or MPS).

Therefore, to minimize computational resource usage, `test_openreg` can be added to certain BLOCKLISTS to prevent its execution, limiting OpenReg's execution to only necessary scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161918
Approved by: https://github.com/albanD
ghstack dependencies: #161917
2025-09-12 23:53:17 +00:00
9b429846e8 [OpenReg] Migrate OpenReg Tests from tests/test_openreg.py into torch_openreg/tests (#161917)
**Background:**

Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic.

**How to do:**

So how do we verify the quality of the third-party device integration mechanism?
We will maintain a `test_openreg` entrypoint in `test/run_test.py`.

This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available.

**Next:**

We will also improve `torch_openreg's` test coverage in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917
Approved by: https://github.com/albanD
2025-09-12 23:53:17 +00:00
cdfa298a3b Revert "[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)"
This reverts commit a3f01f6418667f791f36d928f7e912eb89be2e67.

Reverted https://github.com/pytorch/pytorch/pull/162732 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162732#issuecomment-3287163750))
2025-09-12 23:52:43 +00:00
d25c35d2b2 [MPS] Fix [nan]median output for empty tensors (#162846)
It should be `NaN` rather than 0

Added respective checks to `test_empty_tensor`

Fixes https://github.com/pytorch/pytorch/issues/162798
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162846
Approved by: https://github.com/dcci
2025-09-12 22:26:29 +00:00
ee53ad2dd0 xpu: test py_limited_api with SyclExtension (#162546)
Commit extends existing CUDA test to cover XPU SyclExtension case for the same feature - `py_limited_api`. Commit required a fix for xpu to install some Aten header files (#145902) which got resolved after the merge of #159621.

See: https://github.com/pytorch/pytorch/issues/145902
Requires: https://github.com/pytorch/pytorch/pull/159621
Requires: https://github.com/intel/torch-xpu-ops/pull/1743

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162546
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/janeyx99
2025-09-12 21:57:01 +00:00
0dcd9304aa fix high=0 bug in nll_loss test (#162763)
Minor bug fix for the `nll_loss` test.
Before this PR, it runs `torch.randint(high=0)`, which will fail because it would try to generate a number that >= low and < high, i.e. x>=0 and x<0.

The test did not fail because that line is not run when testing on CPU because it failed earlier because of a unsupported dtype.
However, as we support TPUs at Google, this line is reached first before the dtype check, which triggers the bug.

To my understanding, these OpInfo should be general enough to support different hardware.
Fixing this obvious bug would make it more general cross different hardware.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162763
Approved by: https://github.com/soulitzer
2025-09-12 21:48:18 +00:00
25f1a5d8d1 [inductor][ez] add src_hash property for Templates (#161468)
# why

enable caching/overriding/filtering based on src hash later

# what

- KernelTemplate has a src_hash that is None by default
- sha256 on TritonTemplate of the template src code
- None on ExternKernelChoice to have same API

# testing

n/a (not in use in this change)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D81821149](https://our.internmc.facebook.com/intern/diff/D81821149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161468
Approved by: https://github.com/eellison
ghstack dependencies: #161351, #161350, #162293
2025-09-12 21:10:45 +00:00
269c9907a0 [inductor][choices] rename get_mm_configs to get_template_configs (#162293)
# why

- eventually we want all templates to go through this
- we're exposing this through diode as a sort of interface/API
- avoid later renaming

# what

- rename get_mm_configs to get_template_configs
- rename _finalize_mm_configs to _finalize_template_configs

# testing

- lintrunner
- ci

Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293
Approved by: https://github.com/eellison
ghstack dependencies: #161351, #161350
2025-09-12 21:10:45 +00:00
a326ef37e6 [inductor] leverage template stacking in V.choices.get_mm_configs (#161350)
# why

- now everything is in place to just gather templates and run
  the V.choices.get_mm_configs once per op
- enables any overrides inside V.choices.get_mm_configs to
  have a full view of the options for an op, not just for
  one template

# what

- replace multiple calls to V.choices.get_mm_configs with
  calls to gather the active templates, and then using those
  in a single call

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161351
2025-09-12 21:10:38 +00:00
cdb2d1838a [inductor] FlexibleLayout for ExternKernelChoice for mms (#161351)
# why

- if we only use ExternKernelChoice we're not doing any codegen
- if we're not doing any codegen, we can use a FlexibleLayout
  here, and provide deeper passes more chances to change it

# what

- if all the kernel template choices (KTC) are with a ExternKernelChoice
  template, we switch to a FlexibleLayout before generating the choice
- add a test to make sure that works as intended (FlexibleLayout for
  only extern, and FixedLayout if Triton is involved)

- caveats:
    - because CPP, CUTLASS, and CK are not using
       V.choices.get_mm_configs yet, we turn off the optimization
       if either of those backends are in use. This will be relaxed
       once they support this too
    - because Triton templates are still using their own calls
       (not a single call) to get_mm_configs, it's also turned
       off there. The next diff unifies Triton + ATEN to a single
       call to get_mm_configs and that in turn allows the optimization
       there too

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-12 21:10:31 +00:00
f7ea4975ab update the baseline data for the operator benchmark (#162693)
According to the results of the last four operator benchmark runs, we found that five models achieved more than a 30% improvement compared to the baseline. Therefore, we will update the operator benchmark baseline data.
We use the average results from the four runs as the new baseline for the five models.

And add a pull request trigger for the operator benchmark workflow

Benchmarking   Framework | Benchmarking   Module Name | Case Name | tag | run_backward | baseline   old | r1 | r2 | r3 | r4 | avg | speedup
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
PyTorch | add | add_M1_N1_K1_cpu | short | FALSE | 3.9497 | 2.57 | 2.54 | 2.38 | 2.31 | 2.45 | 1.61
PyTorch | functional.hardtanh | functional.hardtanh_dims(512	512)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 67.118 | 50.02 | 49.80 | 46.78 | 48.94 | 48.88 | 1.37
PyTorch | relu6 | relu6_dims(512	512)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 68.739 | 51.17 | 51.19 | 48.07 | 50.42 | 50.21 | 1.37
PyTorch | relu6 | relu6_dims(256	1024)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 69.1875 | 51.97 | 52.77 | 50.00 | 51.24 | 51.50 | 1.34
PyTorch | functional.hardtanh | functional.hardtanh_dims(256	1024)_contigFalse_inplaceFalse_dtypetorch.quint8 | short | FALSE | 67.436 | 50.98 | 51.69 | 49.06 | 49.87 | 50.40 | 1.34

@chuanqi129 @huydhn @desertfire @jainapurva

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162693
Approved by: https://github.com/huydhn
2025-09-12 20:53:29 +00:00
65d642d6db [ROCm] enable aoti tests, forward fix 162353 (#162827)
Forward fix for tests added by #162353.  Enables aoti tests on rocm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162827
Approved by: https://github.com/dolpm, https://github.com/huydhn
2025-09-12 20:05:50 +00:00
fa4d5e76ea [Inductor] Fix ComboKernels failing due to missing helper functions (#162759)
Fixes: #162756

Differential Revision: [D82257359](https://our.internmc.facebook.com/intern/diff/D82257359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162759
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-09-12 20:01:06 +00:00
38afeb2ba2 Fix markdown link syntax in graph breaks index (#162400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162400
Approved by: https://github.com/Skylion007
2025-09-12 19:29:49 +00:00
53b8bdb977 [MPS] enable cat op for sparse (#162007)
Enable cat op for sparse on MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007
Approved by: https://github.com/malfet
2025-09-12 19:07:39 +00:00
cad052423b [triton] Update 3.5 pin to 5ae38bdb0dc066c5823e34dc9797afb9de42c866 (#162821)
Include @aakhundov's sam_fast patch, plus NVIDIA's sm88/sm110 patches (thanks @nWEIdia)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162821
Approved by: https://github.com/atalman
2025-09-12 18:34:22 +00:00
b5f4a7dc14 Revert "[DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh (#162414)"
This reverts commit 195ac549d7d6538c4212ca73f69488e990b9527d.

Reverted https://github.com/pytorch/pytorch/pull/162414 on behalf of https://github.com/malfet due to Looks like it broke test_circular_deps on Windows, see d89189f289/1 ([comment](https://github.com/pytorch/pytorch/pull/162414#issuecomment-3286070938))
2025-09-12 16:57:09 +00:00
d89189f289 Fix inconsistent clock types in ProcessGroupNCCL::runHookLoop (#162543)
## Summary
This PR fixes an inconsistency in `ProcessGroupNCCL::runHookLoop` when computing `timeStarted`. Both `timeFinished` and `timeStarted` in `WorkInfo` are expected to use `std::chrono::system_clock`, but previously the code was casting a duration from `steady_clock`.

Reviewers suggested using `steady_clock` consistently for time measurement since it is appropriate for durations (see #153135 ). This PR updates both `timeStarted` and `timeFinished` in `WorkInfo`, and corresponding code in `runHookLoop`, to use `std::chrono::steady_clock`.

## Error message:
```
libcxx/include/__memory/allocator_traits.h:302:5: error: no matching function for call to '__construct_at'
  302 |     std::__construct_at(__p, std::forward<_Args>(__args)...);
      |     ^~~~~~~~~~~~~~~~~~~
libcxx/include/__memory/shared_ptr.h:162:33: note: in instantiation of function template specialization 'std::allocator_traits<std::allocator<c10d::WorkInfo>>::construct<c10d::WorkInfo, c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, 0>' requested here
  162 |     allocator_traits<_TpAlloc>::construct(__tmp, __get_elem(), std::forward<_Args>(__args)...);
      |                                 ^
libcxx/include/__memory/shared_ptr.h:736:51: note: in instantiation of function template specialization 'std::__shared_ptr_emplace<c10d::WorkInfo, std::allocator<c10d::WorkInfo>>::__shared_ptr_emplace<c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, std::allocator<c10d::WorkInfo>, 0>' requested here
  736 |   ::new ((void*)std::addressof(*__guard.__get())) _ControlBlock(__a, std::forward<_Args>(__args)...);
      |                                                   ^
libcxx/include/__memory/shared_ptr.h:744:15: note: in instantiation of function template specialization 'std::allocate_shared<c10d::WorkInfo, std::allocator<c10d::WorkInfo>, c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, 0>' requested here
  744 |   return std::allocate_shared<_Tp>(allocator<__remove_cv_t<_Tp> >(), std::forward<_Args>(__args)...);
      |               ^
torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2674:32: note: in instantiation of function template specialization 'std::make_shared<c10d::WorkInfo, c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>, 0>' requested here
 2674 |         onCompletionHook_(std::make_shared<WorkInfo>(
      |                                ^
libcxx/include/__memory/construct_at.h:44:58: note: candidate template ignored: substitution failure [with _Tp = c10d::WorkInfo, _Args = <c10d::OpType, unsigned long, std::chrono::time_point<std::chrono::system_clock, std::chrono::duration<long long, std::ratio<1, 1000000000>>> &, std::chrono::time_point<std::chrono::system_clock> &, std::chrono::duration<float, std::ratio<1, 1000>>>]: no matching constructor for initialization of 'c10d::WorkInfo'
   43 | template <class _Tp, class... _Args, class = decltype(::new(std::declval<void*>()) _Tp(std::declval<_Args>()...))>
      |                                                                                    ~~~
   44 | _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp* __construct_at(_Tp* __location, _Args&&... __args) {
      |                                                          ^
1 error generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162543
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-09-12 16:50:42 +00:00
d71a6497b7 Fix typo in ONNX export error message (#162819)
Fix another "summit" 😅

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162819
Approved by: https://github.com/cyyever, https://github.com/titaiwangms
2025-09-12 16:34:49 +00:00
a0dca0fc60 Fix protobuf test comparison by parsing proto instead of raw strings (#162644)
The tests were comparing raw exported strings for protobuf comparison, which is not backward/forward compatible with different versions of protobuf.

This PR parses the strings into protobuf and compares the protobufs directly, similar to what we did in assertImageProto.

Our test failed because we used a different version of protobuf, which output 44100.0 instead of 44100, which resulted in an error. However, they are equal, but only different in the exported strings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162644
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-09-12 16:26:54 +00:00
e15686b40d Remove actionable label from docathon label sync script (#155713)
Make sure we don't propagate actionable label in docathon sync label script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155713
Approved by: https://github.com/clee2000
2025-09-12 15:36:50 +00:00
1e9ddf510f [ROCm] fix hardsigmoid op (#162758)
Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648
It can be fixed by explicit typing std::min<opmath_t>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162758
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-12 15:07:13 +00:00
7357eb66c5 [ROCm][CI] unskip some test_memory_format tests (#162766)
Fixes #70125.

Much of the work was done by #161687.
This PR is additional test cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162766
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-12 15:02:40 +00:00
03798b0f91 [inductor] Fix removal of constexpr args from the launcher signature (#161924)
Fixes the case described below which occurs when:
- A user `torch.compile`s a function that uses a triton kernel.
- `TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1` .

Problem:

If the user defined triton kernel is not autotuned:

```python
import os
os.environ["TORCHINDUCTOR_DUMP_LAUNCH_PARAMS"] = "1"
@triton.jit
def kernel(..., BLOCK_SIZE: tl.constexpr):
    ...
@torch.compile
def fn(..)
    kernel[..](..., 128)

fn(..)
```

Then In `triton_heuristics. _interpret_args_grid`, `filter_signature` function:

```python
        def filtered_signature() -> list[str]:
                # constexprs are not passed in as args
                return [
                    x
                    for x in self.triton_meta["signature"].keys()
                    if x not in cfg.kwargs.keys()
                ]
```

because `triton.autotune` is not used on the the `triton.jit` function, `cfg` above will be empty, and so `BLOCK_SIZE` will not be removed from the signature even though it is constexpr, even though it is removed from the arguments that are passed in to `interpret_args_grid`. This results in a mismatch between the number of parameters in the signature and the number of arguments, which leads to the error `NameError: name '_grid_2' is not defined`.

Fix:

Use the triton jit kernel `constexprs` for args to remove.  Not sure if this is a good fix so suggestions are welcome.

Test plan:

Added a parameter to an existing triton kernel to test for this edge case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161924
Approved by: https://github.com/davidberard98
2025-09-12 13:58:09 +00:00
6c334885d4 [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-12 10:54:42 +00:00
a7bbc5fea7 [Inductor-FX] Support ScatterFallback (#162686)
# Problem
Inductor has a `ScatterFallback` op with custom Python and C++ wrapper codegen macros. This is used in certain situations where the default Triton codegen doesn't apply, and especially for reductions which need to be deterministic. Since this op used direct Python/C++ codegen, it wasn't compatible with the FX backend.

# Feature
This PR refactors the associated wrapper codegen to support `ScatterFallback`. This follows the same basic steps that were used for other fallback ops including `MultiOutput` and `ExternKernel`:

1. Create a new wrapper IR op called `ScatterFallbackLine`. Move the logic in `ScatterFallback.cogeden` to `ScatterFallbackLine.codegen`, to prevent it from affecting the FX backend. This logic is unsafe for FX because it may generate Python or C++ strings with methods like `codegen_reference()`.
2. To eleminate the dependence on `V.graph`, move language-specific logic to the respective wrapper codegen subclasses. In this case, C++ codegen has some special logic, which is moved to `CppWrapperCpu`.
3. Create a new method in `FXWrapperCodegen` to handle `ScatterFallbackLine`.

# Test plan
Added a couple of CI tests for the FX backend with scatter fallbacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162686
Approved by: https://github.com/jansel
2025-09-12 08:41:50 +00:00
98e9440f30 [1/N] Port 5 _composable/fsdp distributed test cases to Intel GPU (#159118)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- skip some test cases which Intel GPU does not support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159118
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-12 08:36:20 +00:00
66c0f14ecc Support XPU in --nproc-per-node option to torchrun (#159474)
Support both --nproc-per-node=xpu and autodetection of XPU device in case of --nproc-per-node=auto

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159474
Approved by: https://github.com/tsocha, https://github.com/guangyey, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-09-12 08:32:04 +00:00
972e409829 [Reland] Use std::string_view in torchgen (#158625)
Reland of #157050, which is incidentally closed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158625
Approved by: https://github.com/albanD
2025-09-12 08:31:54 +00:00
52af91e4c1 [ROCm/Windows] Support load_inline on windows (#162577)
Supports `torch.utils.cpp_extension.load_inline` on Windows with ROCm.
Tested on Windows with gfx1201.

Note that it currently only works when CC and CXX are set to `clang-cl`. This is also needed when building extensions via. `setuptools` due to linker errors when using `cl` directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162577
Approved by: https://github.com/ezyang
2025-09-12 08:10:07 +00:00
179f10621b port some distributed tensor test files for Intel GPU (#161703)
it's another pr to port distributed tensor test for Intel GPU, while the other pr is https://github.com/pytorch/pytorch/pull/161604
We could enable Intel GPU with following methods and try the best to keep the original code styles:

Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161703
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-12 07:57:32 +00:00
195ac549d7 [DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh (#162414)
We create a wrapper class acting as a layout for device mesh so that we can add new methods more specific to DeviceMesh and keep the core logic of CuTe manipulation inside pycute module. This PR create the main body of the code and then next PR will come with actual implementation and unit test for device mesh layout. (Actual implementation can be found in https://github.com/pytorch/pytorch/pull/161016)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162414
Approved by: https://github.com/ezyang
ghstack dependencies: #162413, #162534
2025-09-12 07:32:56 +00:00
636a511084 [aoti] add config for libtorch free so (#162655)
Users can specify the following to get a libtorch_free `.so`.

"aot_inductor.use_libtorch": False,

The following config is only used for torchnative (see https://github.com/meta-pytorch/torchnative/pull/110). It's not intended to be used by executorch. The reason we need it for torchnative is because a lot of the symbol definitions in torchnative repo is only in header files.

"aot_inductor.libtorch_free_header": "/data/users/shangdiy/torchnative/standalone,/data/users/shangdiy/torchnative/" (or their custom headers)

The main motivating use case is for executorch to produce a libtorch free `.so`.

TODO for follow-up PR: this flag should be consolidated with the `compile_standalone` flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162655
Approved by: https://github.com/angelayi
2025-09-12 07:31:04 +00:00
75de5b65b4 [Dynamo] Don't guard data ptrs by default with mark_static_address (#162208)
Fixes https://github.com/pytorch/pytorch/issues/156377

Since we now re-record cudagraphs, it's not necessary to guard by default anymore and induce a full recompile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162208
Approved by: https://github.com/anijain2305
2025-09-12 07:15:10 +00:00
6b59a19242 Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880))
2025-09-12 06:52:03 +00:00
5f66902ecf Fix operator benchmark issue#162708 (#162744)
This PR skips memory metric calculation for ops which don't take tensor input, fixing the operator_benchmark bug

Fixes https://github.com/pytorch/pytorch/issues/162708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162744
Approved by: https://github.com/huydhn
2025-09-12 06:51:14 +00:00
00e9ba75cd Revert "[indexing] Prevent integer overflow from large step values in C++ (#161707)"
This reverts commit c140bf217f5ca5071ab9dbc1bcf9d4006242f44a.

Reverted https://github.com/pytorch/pytorch/pull/161707 on behalf of https://github.com/huydhn due to Look like there is a land race as lots of jobs are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/161707#issuecomment-3283980465))
2025-09-12 06:49:36 +00:00
333e546c02 [CUDAGraph][UX] warn many times for rerecording from dynamic shapes (#162696)
Excessive re-recording CUDAGraphs lead to bad performance. We previously warns once if this happens.

However, the limit (=50) is too high and users may just observe bad performance before actually seeing the warning message. Even worse, users may not see the warning message when there are many other logs. @anijain2305 reported that he never saw this warning message when using transformer library, but he DOES observe slowdown due to cudagraph re-recording & needs to turn off cudagraph.

#162663 attempts to hard error when re-recording too many times due to dynamic shapes. But it is a bc-breaking change. Actually, hf-t5-generate model in torchbench failed due to 256 re-recordings.

This PR a) reduces to smaller limit (=8); and b) makes the warning more spam, i.e., warn once for every distinct shapes once the limit is reached.

Fixes #162299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162696
Approved by: https://github.com/mlazos
2025-09-12 06:38:32 +00:00
f7e8321961 fix cpp extension distributed warning spew (#162764)
With the new change we only log the warning if we're running non distributed code or if we're in rank 0. Unit testing that certain messages get printed on certain ranks only feels kinda jank so test plan is below instead

Test plan

```python
# torchrun --nproc_per_node=2 demo_fix.py

import os
import logging

logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)

import torch
if 'RANK' in os.environ:
    torch.distributed.init_process_group('nccl')

from torch.utils.cpp_extension import _get_cuda_arch_flags
_get_cuda_arch_flags()

print(f"Rank {os.environ.get('RANK', '0')} done")
```

Logs showing how how `TORCH_CUDA_ARCH_LIST`only shows up once if we explicitly set the the logging level to `logging.DEBUG`. It also improves the debug message to explain what the actual behavior will be

```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py

W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:30:16.594000 1315439 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
[rank0]:V0911 18:30:18.921000 1316753 pytorch/torch/utils/cpp_extension.py:2444] TORCH_CUDA_ARCH_LIST is not set, using TORCH_CUDA_ARCH_LIST='10.0+PTX' for visible GPU architectures. Set os.environ['TORCH_CUDA_ARCH_LIST'] to override.
Rank 0 done
Rank 1 done
```

But if we just use the default and comment out `logging.getLogger('torch.utils.cpp_extension').setLevel(logging.DEBUG)`

Then we get

```
(source) [marksaroufim@devgpu005]~% torchrun --nproc_per_node=2 demo_fix.py
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814]
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0911 18:14:33.926000 690759 /home/marksaroufim/pytorch/torch/distributed/run.py:814] *****************************************
Rank 0 done
Rank 1 done
(source) [marksaroufim@devgpu005]~%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162764
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-12 06:12:46 +00:00
30e16d6389 [nativert] aoti (#162353)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162353
Approved by: https://github.com/yiming0416
2025-09-12 05:56:25 +00:00
28e8531032 Add api info for torch._C._nn.pyi (#162361)
Fix part of #148404

APis involved are as followed:

- im2col
- l1_loss
- mish
- mish_
- mse_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162361
Approved by: https://github.com/ezyang
2025-09-12 05:56:22 +00:00
0babdfad63 [1/N] Port 6 fsdp distributed test cases to Intel GPU (#160158)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- Instantiate_device_type_tests()
- Use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- Enabled XPU for some test path
- Added allow_xpu=True for supported test class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160158
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-12 05:52:08 +00:00
561430edcd [CuTe] Add type for CuTe layout via claude (#162534)
This PR mostly is a cosmetic change using Claude to add types for copied PyCute code.
We removed all suppressions of linters and add type checker, type alias and mypy ignore(if needed) so that the pycute code will be checked by linter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162534
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #162413
2025-09-12 04:59:21 +00:00
79d2418b5a [inductor] Add FLOAT_IS_NAN and COMPLEX_IS_NAN guards (#162537)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162537
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #162528
2025-09-12 04:32:46 +00:00
5dd84559a5 [dynamo] Add DUAL_LEVEL_MATCH C++ guard (#162528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162528
Approved by: https://github.com/anijain2305
2025-09-12 04:32:46 +00:00
5dd14f0b65 [CuTe] Copy code from pycute for device mesh bookkeeping (#162413)
We copied the whole module and its unit test into pytorch codebase. (https://github.com/NVIDIA/cutlass/blob/main/python%2Fpycute%2Flayout.py).

We did change the indentation of code from 2 spaces to 4 spaces. And add lint suppressor to make mypy happy.

Also we need to make changes to unit test to include ownership and use `run_tests, TestCase` so that the test gets picked up by CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162413
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-09-12 04:28:03 +00:00
95191522e0 [OpenReg] Implement device autoload mechanism (#158555)
# Implement OpenReg device autoload mechanism

## Overview
The **Autoload** mechanism in PyTorch simplifies the integration of third-party device backends by enabling automatic discovery and initialization at runtime. Traditionally, integrating a new backend required explicit imports or manual initialization, which could be cumbersome and error-prone. With Autoload, PyTorch dynamically detects and initializes device backends, providing a seamless user experience.

This mechanism leverages Python entry points (e.g., `torch.backends`) and dynamic module loading. When PyTorch starts, it scans for registered entry points and invokes their initialization hooks, ensuring that all available backends are ready for use without requiring explicit imports.

## Motivation

This PR aims to apply [device autoload mechanism](https://github.com/pytorch/pytorch/issues/122468) to the OpenReg module with some simple changes.

## Change
### Before
```python
import torch
import torch_openreg

x = torch.tensor([1, 2, 3], device="openreg")
print(x)
```
### After
```python
import torch

# No need to import torch_openreg manually!
x = torch.tensor([1, 2, 3], device="openreg")
print(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158555
Approved by: https://github.com/FFFrog, https://github.com/albanD

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-09-12 04:24:11 +00:00
da954f10d6 Bump protobuf from 5.29.4 to 5.29.5 in /.github/requirements (#160844)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 5.29.4 to 5.29.5.
<details>
<summary>Commits</summary>
<ul>
<li><a href="f5de0a0495"><code>f5de0a0</code></a> Updating version.json and repo version numbers to: 29.5</li>
<li><a href="85637662f7"><code>8563766</code></a> Merge pull request <a href="https://redirect.github.com/protocolbuffers/protobuf/issues/21858">#21858</a> from shaod2/py-cp-29</li>
<li><a href="05ba1a8104"><code>05ba1a8</code></a> Add recursion depth limits to pure python</li>
<li><a href="1ef3f01c46"><code>1ef3f01</code></a> Internal pure python fixes</li>
<li><a href="69cca9b7f5"><code>69cca9b</code></a> Remove fast-path check for non-clang compilers in MessageCreator. (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/21612">#21612</a>)</li>
<li><a href="21fdb7acdb"><code>21fdb7a</code></a> fix: contains check segfaults on empty map (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20446">#20446</a>) (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20904">#20904</a>)</li>
<li><a href="03c50e3874"><code>03c50e3</code></a> Re-enable aarch64 tests. (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20853">#20853</a>)</li>
<li><a href="128f0aafd9"><code>128f0aa</code></a> Add volatile to featuresResolved (<a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20767">#20767</a>)</li>
<li><a href="bdd49bb141"><code>bdd49bb</code></a> Merge pull request <a href="https://redirect.github.com/protocolbuffers/protobuf/issues/20755">#20755</a> from protocolbuffers/29.x-202503192110</li>
<li><a href="c65946848f"><code>c659468</code></a> Updating version.json and repo version numbers to: 29.5-dev</li>
<li>See full diff in <a href="https://github.com/protocolbuffers/protobuf/compare/v5.29.4...v5.29.5">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=protobuf&package-manager=pip&previous-version=5.29.4&new-version=5.29.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160844
Approved by: https://github.com/msaroufim

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-12 04:23:03 +00:00
d959eb02cb [audio hash update] update the pinned audio hash (#162752)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162752
Approved by: https://github.com/pytorchbot
2025-09-12 04:18:54 +00:00
62f044e260 Bump setuptools from 72.1.0 to 78.1.1 in /.github/requirements (#162701)
Bumps [setuptools](https://github.com/pypa/setuptools) from 72.1.0 to 78.1.1.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst)
- [Commits](https://github.com/pypa/setuptools/compare/v72.1.0...v78.1.1)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-version: 78.1.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-11 21:03:27 -07:00
2335f90414 [ONNX] Support enable_gqa when dropout is non-zero (#162771)
Fixes #162258
Related to https://github.com/microsoft/onnxscript/pull/2558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162771
Approved by: https://github.com/justinchuby
2025-09-12 04:00:57 +00:00
6e8f17c580 [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-12 03:56:18 +00:00
31345fb4f7 Make functorch notebook symlinks PEP 517 valid (#157813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157813
Approved by: https://github.com/zou3519, https://github.com/atalman
2025-09-12 03:52:08 +00:00
872ed60679 [mxfp8 torch._scaled_grouped_mm] fix meta registration for 3d tensor (#162765)
Meta registration checks for torch._scaled_grouped_mm has a bug for 3d "B" tensors. Namely, the scale shape for such a tensor should be 2d with shape (G, blocked_K * blocked_N), but it currently enforces an expected 3d shape of (G, blocked_K, blocked_N).

See Blas.cpp for correct validation logic [here](8e217a9f6d/aten/src/ATen/native/cuda/Blas.cpp (L1622)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162765
Approved by: https://github.com/ngimel
2025-09-12 03:51:52 +00:00
e8eeb06034 Move inductor jobs 3.9->3.10 (#162323)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323
Approved by: https://github.com/huydhn, https://github.com/Skylion007

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-12 03:43:06 +00:00
3cd734584d bring back the old vllm's use_existing_torch.py (#162747)
vllm's pr will override our dependencies for torch.

quick fix to add the use_existing_torch.py. syncing with vllm now regarding the uv approach they have

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162747
Approved by: https://github.com/huydhn
2025-09-12 03:41:39 +00:00
222ec8d28e Revert "AMD CPU CI - Add freezing + fix label trigger (#162176)"
This reverts commit 9cac1b92595ec7836101d51dbe1415081042c7a0.

Reverted https://github.com/pytorch/pytorch/pull/162176 on behalf of https://github.com/huydhn due to Sorry for reverting this but hardcoding the input online 122 does not make sense ([comment](https://github.com/pytorch/pytorch/pull/162176#issuecomment-3283532452))
2025-09-12 03:39:13 +00:00
c140bf217f [indexing] Prevent integer overflow from large step values in C++ (#161707)
Fixes https://github.com/pytorch/pytorch/issues/160868
hmmm, I found an existing fix PR after I've finished this one. For reference, the old PR was
https://github.com/pytorch/pytorch/pull/147433/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161707
Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/mlazos
2025-09-12 03:16:23 +00:00
7eb92b076f [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templates (#162678)
Summary: Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```

Rollback Plan:

Differential Revision: D82174075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162678
Approved by: https://github.com/coconutruben
2025-09-12 02:12:33 +00:00
ccb450b190 [pre_compile] Add check for cuda and hardware version (#162438)
if we detect compiled model is using cuda in meaningful way, we should store information about cuda + hardware

 Example: `SystemInfo(python_version='3.12.9', torch_version='2.9.0a0+gite02b0e6', cuda_version='12.6', triton_version=(3, 4), gpu_name='NVIDIA PG509-210')`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162438
Approved by: https://github.com/zhxchen17
2025-09-12 01:42:07 +00:00
ae97eb86f7 Reland "Fix conv exhaustive autotuning and expand Exhaustive test coverage" (#161957)
reland https://github.com/pytorch/pytorch/pull/159387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161957
Approved by: https://github.com/coconutruben
2025-09-12 01:36:43 +00:00
7a9c4d794c [BUG]Fixed handle cannot be hit in the cache in the IPC ExpandableSegment (#161885)
Fixed the bug that handle cannot be hit in the ipcMemHandle_to_devptr cache in the IPC scenario of ExpandableSegment.

Fixes #161884

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161885
Approved by: https://github.com/albanD
2025-09-12 01:09:17 +00:00
501e19137a fix var args for shape guards (#162633)
Summary: Fixes #162599

Test Plan:
added test based on repro

Rollback Plan:

Differential Revision: D82144520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162633
Approved by: https://github.com/tugsbayasgalan
2025-09-12 00:33:35 +00:00
4a757e1e17 [ROCm] Support torch.cuda._compile_kernel (#162510)
Supports `torch.cuda._compile_kernel` on ROCm. Related to https://github.com/pytorch/pytorch/pull/151484
Tested on Windows with gfx1201. Testing on Linux pending.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162510
Approved by: https://github.com/mycpuorg, https://github.com/msaroufim
2025-09-12 00:18:47 +00:00
563921619b Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162168)
This issue was introduced by the commit in issue #161065. Added an extra check to provide a proper path for other platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162168
Approved by: https://github.com/mingfeima, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-12 00:17:08 +00:00
84d8ec73f1 [CD] Build Mac wheels using setup-python action (#162136)
Biggest difference between both conda and homebrew CPython builds and one from python.org, is that later are universal binaries and they are always trying to build universal extension...

Workaround lots of universal binary build attempts by explicitly specifying both `_PYTHON_PLATFORM` and `--plat-name` as well as `ARCH_FLAGS`

Suppressed actionlint warning on use of `freethreaded` flag which is document in https://github.com/actions/setup-python/tree/v5

TODO: Remove lots of temporary workarounds when `3.14` is out in October 2025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162136
Approved by: https://github.com/atalman, https://github.com/huydhn
ghstack dependencies: #162297, #162265
2025-09-12 00:16:31 +00:00
a956066b4e [ROCm] Define uint32 t when ROCM_VERSION >= 70000 (#160587)
This PR fixes the errors like below:
```
[rank3]: RuntimeError: The following operation failed in the TorchScript interpreter.
[rank3]: Traceback of TorchScript (most recent call last):
[rank3]: RuntimeError: /tmp/comgr-28f951/input/CompileSourceACC062:67:7: error: unknown type name 'uint32_t'; did you mean '__hip_internal::uint32_t'?
[rank3]:    67 |       uint32_t int32;
[rank3]:       |       ^~~~~~~~
[rank3]:       |       __hip_internal::uint32_t
```
Earlier uint32_t was defined in HIP headers in std namespace. Now it is moved to __hip_internal namespace in hip headers. This change is made in ROCm 7.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160587
Approved by: https://github.com/jeffdaily
2025-09-12 00:13:26 +00:00
ff6870d134 [BE][flex attention] compute RMSE in float64 (#162088)
I saw a failure where the reference error was 0.0, and the compiled error was 0.035. Although the failure still occurs with or without this change, it was confusing to see RMSE of 0.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162088
Approved by: https://github.com/drisspg
2025-09-11 23:53:31 +00:00
92f9ed7ac3 Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473)"
This reverts commit fa1d409e83af93425a2672d62e134e8f20c5ccc0.

Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an distributed tests ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3282999084))
2025-09-11 23:51:21 +00:00
8e217a9f6d [precompile] Fix issues with guard serialization on distributed types. (#162418)
Summary: Add more support for torch internal distributed data structures.

Test Plan:
test_guard_serialization.py

Rollback Plan:

Differential Revision: D81927732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162418
Approved by: https://github.com/dolpm
2025-09-11 23:09:55 +00:00
429052f151 fix: raise value error on init ParametrizationList if original.device != new.device (#162717)
raise value error on init `ParametrizationList`, if `original.device != new.device`.
currently `_maybe_set` will throw below error in such situations, which I think it's not convenient to debug.

```
[rank1]: RuntimeError: Attempted to set the storage of a tensor on device "cuda:1" to a storage on different device "cpu".  This is no longer allowed; the devices must match.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162717
Approved by: https://github.com/lezcano
2025-09-11 23:07:58 +00:00
a3f01f6418 [MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)
Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass.

Test Plan:
CIs created in D81274963 and D81286593 pass.

Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests.
https://www.internalfb.com/intern/testinfra/testrun/15481123829281588

Rollback Plan:

Differential Revision: D82161434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162732
Approved by: https://github.com/danielhou0515
2025-09-11 22:47:03 +00:00
62843c14bb [ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)
Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162330
Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
2025-09-11 22:35:09 +00:00
082d3dd9d5 [Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688)
Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`.

Test Plan:
mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`

Rollback Plan:

Differential Revision: D82181924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
2025-09-11 22:17:57 +00:00
468c1f9e9d Revert "[nn] Assert parsed iterable arguments are an appropriate length (#162340)"
This reverts commit b5e6e58050bd2a15f4173cfffa00c7e32e382b49.

Reverted https://github.com/pytorch/pytorch/pull/162340 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an MPS tests on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/162340#issuecomment-3282676242))
2025-09-11 21:22:57 +00:00
9614c2eb14 [Triton] [Inductor] Pruned failed compilations from Autotuning candidates (#162673)
Summary:
When exahaustively autotuning a new template you may hit situations that lead to compilation failures. This template will still attempt to autotune because nothing was marking this as failed and in my experiments lead to a crash/segfault if I didn't set `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`.

To help eliminate this issue this PR marks any template that fails to compile as "failed" and then removes all of the failed templates from the choice candidates. In the case where it would have just failed to compile twice, this should at least reduce compilation time.

Test Plan:
Tested locally when experminenting with the new blackwell templates and a Triton version that contains a bug related to `num_warps < 4`.

Rollback Plan:

Differential Revision: D82172207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162673
Approved by: https://github.com/PaulZhang12, https://github.com/mlazos
2025-09-11 21:22:36 +00:00
4c6a6c2db9 [Inductor][FP8] Add new scaled_mm and scaled_persistent_mm configs to Inductor FP8 Triton templates (#162699)
Summary:
Add new `scaled_mm` and `scaled_persistent_mm` configs to `template_heuristics.py` for Inductor FP8 Triton templates. These configs are a representative subset of the most performant configs generated from exhaustively autotuning FP8 Triton kernels with per-tensor and per-row scaling.

See this [spreadsheet](https://docs.google.com/spreadsheets/d/1Fal1vhFUJIUcLpM2kJect6IkgeUFvCY-nUr3RTupM_4/edit?gid=1732602731#gid=1732602731) for benchmarks and performance metrics.

Test Plan:
Verify that configs do not error, i.e.
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+i
nductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only pt2_fp8_gemm --metrics tflops,accuracy --input-loader={input_path} --output="{output_csv}" --atol=1e-2 --rtol=0.5 2>&1 | tee {log_file}
```

Rollback Plan:

Reviewed By: NikhilAPatel, PaulZhang12

Differential Revision: D81651226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162699
Approved by: https://github.com/PaulZhang12
2025-09-11 21:21:06 +00:00
3ad3bfe11d added example for torch.is_storage (#162614)
Fixes #162613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162614
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-11 20:25:26 +00:00
1c6dfbe557 Revert "[inductor] FlexibleLayout for ExternKernelChoice for mms (#161351)"
This reverts commit f08487aa8692751c36e608e338204490b0955583.

Reverted https://github.com/pytorch/pytorch/pull/161351 on behalf of https://github.com/huydhn due to Check with @coconutruben and the internal failures look real ([comment](https://github.com/pytorch/pytorch/pull/161351#issuecomment-3282511692))
2025-09-11 20:24:15 +00:00
934f878883 Revert "[inductor] leverage template stacking in V.choices.get_mm_configs (#161350)"
This reverts commit 623e623c821f639559248e9acd6084311c8fd3d5.

Reverted https://github.com/pytorch/pytorch/pull/161350 on behalf of https://github.com/huydhn due to Check with @coconutruben and the internal failures look real ([comment](https://github.com/pytorch/pytorch/pull/161351#issuecomment-3282511692))
2025-09-11 20:24:15 +00:00
cef05b1202 Revert "[inductor][choices] rename get_mm_configs to get_template_configs (#162293)"
This reverts commit 30191fcf03ddd6a09381a490096c4bb721874316.

Reverted https://github.com/pytorch/pytorch/pull/162293 on behalf of https://github.com/huydhn due to Check with @coconutruben and the internal failures look real ([comment](https://github.com/pytorch/pytorch/pull/161351#issuecomment-3282511692))
2025-09-11 20:24:15 +00:00
b500c166ef [FlexAttention][Easy] turn off TMA when cannot use it (#162569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162569
Approved by: https://github.com/drisspg
2025-09-11 19:51:19 +00:00
d65ffdef3d [ROCm] fix miopen batchnorm changing output format (#162112)
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-09-11 19:37:48 +00:00
ac72f81c12 [dynamic shapes] unbacked-safe should_swap (#160473)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160473
Approved by: https://github.com/laithsakka
2025-09-11 18:51:25 +00:00
9cac1b9259 AMD CPU CI - Add freezing + fix label trigger (#162176)
Added the following changes:

1. Added freezing by default for AMD CPU based CI
2. Fixed issue with label based CI triggers

Addresses code review comment in https://github.com/pytorch/pytorch/pull/161155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162176
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2025-09-11 18:41:29 +00:00
9bc648235d [MPS] mps sparse mul op implementation (#162349)
Implements mps sparse mul operation as well as enables other operations such as:
1. copy_
2. div
3. sum
4. floor
5. power
6. sub
7. floor_divide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349
Approved by: https://github.com/pearu, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-11 18:36:24 +00:00
799471d92b [triton] Update 3.5 pin (AMD compilation fix + warp spec) (#162733)
Fixes #162390

Also adds warp spec (thanks @manman-ren!)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162733
Approved by: https://github.com/atalman
2025-09-11 18:19:16 +00:00
43d9b5ecaa [ONNX] Set fallback=False by default (#162726)
This change addresses confusing error messages users encounter when using the ONNX exporter with default settings. Previously, `fallback=True` was the default, which would attempt to fall back to the TorchScript exporter when the dynamo path failed, leading to mixed error messages that obscured the actual issues.

## Problem

When `fallback=True` by default:
- Users get confusing error messages mixing dynamo and TorchScript export failures
- Error messages tell users to provide the `f` argument unnecessarily
- Dynamo error messages get flushed with TorchScript errors when both paths fail
- Users expecting the dynamo path get unexpected fallback behavior

## Solution

Changed the default from `fallback=True` to `fallback=False` in both:
- `torch.onnx.export()` function
- `torch.onnx._internal.exporter._compat.export_compat()` function

## Impact

**Before:**
```python
# Would fallback to TorchScript on dynamo failure, causing mixed error messages
torch.onnx.export(model, args)
```

**After:**
```python
# Clean dynamo-only errors by default
torch.onnx.export(model, args)

# Advanced users can still opt-in to fallback behavior
torch.onnx.export(model, args, fallback=True)
```

Fixes #162697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162726
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2025-09-11 18:09:58 +00:00
463fbc8ca0 Support vmap + custom autograd function/improve DTensor constructor inefficiency (#162240)
This makes gemma3 exportable on transformers=4.55.4

In HF, there is a torch funciton mode called TransformGetItemToIndex which internally calls custom autograd function. When this custom autograd function is called under vmap, It triggers CustomFunctionHigherOrderOP which error-ed because there was no pre-dispatch proxy mode implementation.

Since there are number of requests lately to add various operators in pre-dispatch IR, I introduce a decorator in export that works similar to `allow_in_graph`. Basically:
1) We intercept custom_autograd_function.apply at pre-dispatch mode when this decorator is applied
2) We apply `flat_apply` HOP to hide the pytree spec for this autograd function. Note that this adds restriction that this custom autograd function needs to take in fx-able types.
3) subclass constructor decorator is implemented similarly, so we just refactor it to use similar implementation as this new decorator. eventually we should delete the subclass constructor decorator.
4) Move some code in subclass constructor decorator to exit early in non-export environment which should shave off some inefficiency (around 1% according to @swolchok 's benchmark)

Fixes: https://github.com/pytorch/pytorch/issues/161563#issuecomment-3246309758

Differential Revision: [D82141316](https://our.internmc.facebook.com/intern/diff/D82141316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162240
Approved by: https://github.com/ydwu4
2025-09-11 17:42:41 +00:00
2f53395943 [ez][CI] Fix docs push in nightly workflow (#162657)
HUD metrics page says docs push hasn't happened in 21 days
<img width="293" height="142" alt="image" src="https://github.com/user-attachments/assets/f930aab8-0503-4bf2-b962-8c375dec6b78" />

I guess main branch docs just haven't been updated?  Did anyone notice?  Do we care?

Either way I think this should fix it

Likely started after https://github.com/pytorch/pytorch/pull/161182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162657
Approved by: https://github.com/huydhn
2025-09-11 16:45:41 +00:00
fccddf02b6 repro 161902 (#162416)
Summary:
Sometimes `ShapeEnv.create_symbol` can return a `sympy.Integer`. This messes up our phantom symbol infra for derived dims.

Fixes #161902

Test Plan:
added test based on repro

Rollback Plan:

Differential Revision: D81960709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162416
Approved by: https://github.com/tugsbayasgalan
2025-09-11 16:35:23 +00:00
8be8b94793 Update SECURITY.md with reporting guidelines (#162608)
Added clarification that all reports will be disclosed within 90 days

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162608
Approved by: https://github.com/seemethere, https://github.com/albanD
2025-09-11 16:30:29 +00:00
suo
fe8cc619b8 [torch][c10d] fix split_group in mixed backend case (#162424)
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.

However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.

This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.

Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.

As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
2025-09-11 16:29:32 +00:00
2f5a24c2a2 Smoke tests don't run nvshmem on Windows (#162646)
Only available for linux x86 and aarch64 :
https://pypi.org/project/nvidia-nvshmem-cu13/#files

nvshmem is available only on linux:
``
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | "
``
https://github.com/pytorch/pytorch/blob/main/.github/scripts/generate_binary_build_matrix.py#L57
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162646
Approved by: https://github.com/kwen2501
2025-09-11 16:09:20 +00:00
24492cbab2 [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-11 15:48:43 +00:00
3f6d88f04c paths to exclude shape guards (#162684)
Summary: Easier to land than https://www.internalfb.com/diff/D82030581

Test Plan:
everything blamed by https://www.internalfb.com/diff/D80713603 (except some old exir tests)

Rollback Plan:

Differential Revision: D82180349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162684
Approved by: https://github.com/tugsbayasgalan
2025-09-11 15:34:06 +00:00
94db2ad51d Revert "Move prioritized text linker optimization code from setup.py to cmake (#160078)"
This reverts commit 26b3ae58908becbb03b28636f7384d2972a8c9a5.

Reverted https://github.com/pytorch/pytorch/pull/160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/160078#issuecomment-3281426631))
2025-09-11 15:29:29 +00:00
9f783e172d Revert "Build and Install Arm Compute Library in manylinux docker image (#159737)"
This reverts commit 582d278983b28a91ac0cedd035183f2495bb6887.

Reverted https://github.com/pytorch/pytorch/pull/159737 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503) ([comment](https://github.com/pytorch/pytorch/pull/159737#issuecomment-3281398272))
2025-09-11 15:25:24 +00:00
a8432bcaad [dynamo][guards] Fail on an unknown framelocals to dict conversion (#162695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162695
Approved by: https://github.com/williamwen42
ghstack dependencies: #162694
2025-09-11 15:01:00 +00:00
a3a40cb741 [dynamo][guards] Do not consturct framelocals to dict on GlobalsGuardAccessor (#162694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162694
Approved by: https://github.com/williamwen42
2025-09-11 15:01:00 +00:00
c924c675d0 Fix persistent buffer bug (#162190)
For non-persistent buffers, we should properly register them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162190
Approved by: https://github.com/zhxchen17
2025-09-11 14:56:26 +00:00
c3f30eca9e Remove tests-to-include from rocm-mi300 workflow (#162721)
Accidentally introduced by https://github.com/pytorch/pytorch/pull/162288 (was meant to be a temporary change)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162721
Approved by: https://github.com/jeffdaily
2025-09-11 14:36:07 +00:00
1e710552c1 [ROCm][CI] benchmark must patch fbgemm_gpu with tbb dep (#162649)
fbgemm adds tbb as a dep only for rocm to avoid missing tbb symbols at import.  But the way it was done was in setup.py to add the linker flag to CMAKE_CXX_FLAGS and it wasn't working for reasons unknown to me.  But what did work was to add tbb as a dep in the cmake file.  [We have a PR against upstream fbgemm](https://github.com/pytorch/FBGEMM/pull/4859) for that.  Meanwhile, a much smaller patch is applied here in this PR until the fbgemm rocm ci commit hash is moved forward to include the tbb patch from upstream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162649
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-11 14:10:51 +00:00
7c39b2ecbe use torch.accelerator and device_module instead of cuda to make DataParallel more device agnostic. (#162573)
use torch.accelerator and `_get_device_module` instead of cuda to make DataParallel more device agnostic.

Fixes #162152

recently, I've done some works to support my own privateuse1 backend in DataParallel module, but I found some cuda related APIs exist in parallel_apply.py file, that makes me have to monkey patch DataParallel module to support DP on my own backend.

so I make some small changes to replace cuda.xxx to accelerator.xxx, and acquire device module by `_get_device_module`.

this is my first time to contribute to pytorch, please let me know if there is any problem about the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162573
Approved by: https://github.com/ezyang, https://github.com/guangyey

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
2025-09-11 10:04:27 +00:00
afdd4247a2 [torchao][pt2e] Make prepare and convert faster by caching (#162550)
Summary: D79674759 tried to fix the expensive prepare and convert steps, as `assert_and_get_unique_device` was called multiple times. This change fixes that issue by using `functools.cache` decorator.

Test Plan:
Verified on llm export to QNN.
LLM Quantization prepare time of ~20min reduced to ~3min.

Rollback Plan:

Differential Revision: D82073679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162550
Approved by: https://github.com/andrewor14
2025-09-11 07:59:22 +00:00
22df9332da [serialization] Add pte file to archive (#162520)
Summary:
Add _package_executorch_files to archive apis. Allow us to package a PTE file into the archive.

I don't think there's a use-case to have more than one PTE file at the moment, but left it as `EXECUTORCH_FILES` just in case.

Test Plan:
Tested in D81992612

Rollback Plan:

Differential Revision: D81977483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162520
Approved by: https://github.com/angelayi
2025-09-11 07:59:11 +00:00
6b9b7ce6fe fix torch.sparse.log_softmax on CPU (#161959)
Fix https://github.com/pytorch/pytorch/issues/152293.

**Example:**
```
import torch
from torch.sparse import log_softmax as sparse_log_softmax

def test_bug():
    a = torch.rand(4, 3)
    b = a - 10000000.0
    b_sparse = b.to_sparse()

    cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense()
    print('cpu_out_sparse =', cpu_out_sparse)

    b_sparse_double = b.double().to_sparse()
    cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense()
    print('cpu_out_sparse_double =', cpu_out_sparse_double)

if __name__ == '__main__':
    test_bug()
```

**Output:**

- before
```
cpu_out_sparse = tensor([[-2., -1., -2.],
        [-1., -1., -1.],
        [-1., -2., -2.],
        [-1., -1., -2.]])
cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514],
        [-1.0986, -1.0986, -1.0986],
        [-0.5514, -1.5514, -1.5514],
        [-0.8620, -0.8620, -1.8620]], dtype=torch.float64)
```

- after
```
cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]])
cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]], dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/mingfeima
2025-09-11 07:52:05 +00:00
1274297e06 Remove __torch_dispatch__ check in THPVariable_make_dtensor (#162337)
We control DTensor, so we can just guarantee there isn't a programming error with __torch_dispatch__. (The guard is already less-than-perfect; see the note that the deleted comment refers to.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162337
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218, #161596
2025-09-11 06:58:35 +00:00
f68f76d8c7 Remove logger.debug statements in DTensor dispatch (#161596)
These seem to have been costing us 5-10 usec per detach (out of ~~95 usec total).  If they need to ship let's talk about requirements and how we can make this more efficient given that we would prefer if an entire DTensor op could finish in 10 usec.

Differential Revision: [D81530106](https://our.internmc.facebook.com/intern/diff/D81530106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161596
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220, #162218
2025-09-11 06:58:35 +00:00
fa1d409e83 [2/N]Port several test files under test/distributed to Intel GPU (#159473)
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
  Added xpu for Backend.backend_capability and dist.Backend.register_backend()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-09-11 06:44:26 +00:00
52d4660ae9 [AOTI] Fix Windows fail to zip opened file. (#162617)
Original issue:
<img width="1767" height="544" alt="Image" src="https://github.com/user-attachments/assets/9de90d50-217f-4049-8f19-77ff1660c8b0" />

reproducer:
```cmd
pytest test\inductor\test_aot_inductor.py -v -k test_weight_on_disk_legacy_cpu
```

Fixed list:
1. `WritableTempFile`'s `__exit__` function auto unlink opened file, when the file was opened, it should raise error. Ignore it on Windows.
2. When open zip file, if the file is opened, it would be failed. Switch to `_wfsopen` with shared access flag, which can open file with shared access.

Local test passed:
<img width="1101" height="233" alt="image" src="https://github.com/user-attachments/assets/935cbf2e-52db-41f1-80fa-617569b92a96" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162617
Approved by: https://github.com/jansel
2025-09-11 06:22:21 +00:00
7345454e2e compile_kernel: Handle python floats as c double (#162626)
This was an open todo in the code and probably a footgun in waiting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162626
Approved by: https://github.com/malfet
2025-09-11 06:03:25 +00:00
23170dfebc Revert "Move inductor jobs 3.9->3.10 (#162323)"
This reverts commit 0663bdb12383b9717af49d58aed9d88de0dd0ecc.

Reverted https://github.com/pytorch/pytorch/pull/162323 on behalf of https://github.com/huydhn due to Not sure what had happened, but some inductor unit tests start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/162323#issuecomment-3278125192))
2025-09-11 05:57:13 +00:00
12e993f533 compile_kernel large shared memory fix (#162647)
Alternate solution to https://github.com/pytorch/pytorch/pull/162328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647
Approved by: https://github.com/eqy
2025-09-11 05:52:46 +00:00
07d2531672 [vllm hash update] update the pinned vllm hash (#162551)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162551
Approved by: https://github.com/pytorchbot
2025-09-11 04:56:04 +00:00
6944d4b639 [ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-09-11 03:34:07 +00:00
f654cff566 [inductor] Add shape to load_input in matmul templates (#162513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162513
Approved by: https://github.com/eellison
ghstack dependencies: #162426
2025-09-11 01:51:15 +00:00
f17c5e0789 [inductor] Add shape for store_output in matmul templates (#162426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162426
Approved by: https://github.com/eellison
2025-09-11 01:51:15 +00:00
435c18fb4a [DTensor] add op support for aten.unbind.int (#162560)
As titled.

It seems unbind returns views of the original tensor. E.g. see https://stackoverflow.com/questions/78910951/does-unbind-return-the-views-of-tensors-in-pytorch

So we error out when `shard_dim == unbind_dim`. This is similar to why we error out in view ops.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_view_ops.py#L544-L546

This PR also refactors some other tensor ops code, by creating two utils function `shift_shard_dims_after_insert`, `shift_shard_dims_after_remove`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162560
Approved by: https://github.com/zpcore
2025-09-11 00:58:23 +00:00
612cdc8f48 -ldl for nativert tests (#162643)
Fixes #162640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162643
Approved by: https://github.com/yiming0416, https://github.com/robert-hardwick
2025-09-11 00:35:57 +00:00
da5069f289 Don't include cuh header when USE_NVSHMEM is off (#162635)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162635
Approved by: https://github.com/kwen2501
2025-09-11 00:24:50 +00:00
4fd2a2b273 Add cuda headers automatically for compile_kernel (#162634)
Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst

Benefit is now we can add

`#include <cuda_fp16.h>` without crapping out
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634
Approved by: https://github.com/ngimel
2025-09-11 00:20:33 +00:00
bb1d53bc47 [CD] CUDA 13 specific followup changes (#162455)
Follow up for CUDA 13 bring up https://github.com/pytorch/pytorch/issues/159779
sm50-70 should not be added to sbsa build arch list, as previous archs had no support for arm.
remove platform_machine from PYTORCH_EXTRA_INSTALL_REQUIREMENTS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162455
Approved by: https://github.com/atalman
2025-09-11 00:03:47 +00:00
36338fc7f2 Relax fences for intrusive ptr's refcnt (#162072)
Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing.

lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read.

incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning.

decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is.

This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072
Approved by: https://github.com/swolchok, https://github.com/yfeldblum
2025-09-10 23:17:01 +00:00
e0c910149c Build fbgemm_gpu for TORCH_CUDA_ARCH_LIST=10.0 and CUDA 12.8 and 12.9 (#162544)
## Summary
- pytorch is not built for *a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209)

## Changes
- **Setting USE_FBGEMM_GENAI for CUDA builds**: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](2033a0a08f/.github/scripts/nova_dir.bash (L29-L32))), so I follow the same rule here.
- **Extra nvcc flags**: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a

## Test plan

Test build:
```
echo $CUDA_HOME
/usr/local/cuda-12.9

export TORCH_CUDA_ARCH_LIST=10.0
python -m pip install --no-build-isolation -v -e .
```

Check build logs:
```
  CMake Warning at CMakeLists.txt:901 (message):
    Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a
```

Run unit tests:
- `pytest test/test_matmul_cuda.py  -k test_mxfp8_scaled_grouped_mm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544
Approved by: https://github.com/drisspg
2025-09-10 22:59:41 +00:00
f4aeceaa9d Use upper bound for persistent rblock (#162441)
Previously, we were using 128 and increasing to upper bound. We should be setting at the upper bound and raising to next power of 2.

Differential Revision: [D81984103](https://our.internmc.facebook.com/intern/diff/D81984103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162441
Approved by: https://github.com/PaulZhang12
2025-09-10 22:29:02 +00:00
d8e6b2fddc [Cutlass] Add exp and sigmoid activations (#162536)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162536
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
ghstack dependencies: #162535
2025-09-10 21:44:26 +00:00
31c25c7d01 [Cutlass] Add tanh activation and test case for activations (#162535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162535
Approved by: https://github.com/henrylhtsang
2025-09-10 21:44:26 +00:00
eqy
5dbee5691c [cuDNN][Convolution][TF32][64bit] Add tf32_on_and_off decorator to conv3d 64bit test (#161004)
cuDNN has new generated kernels that can use TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161004
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2025-09-10 21:39:35 +00:00
864ffe12d7 Fix some edge cases (#162295)
``` Summary
🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔺 Top 5 Cases Where no_peel (change) is Faster than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 56.937931         ┆ 58.960459            ┆ 1.035522                  ┆ 3.552163  │
│ causal         ┆ torch.bfloat16 ┆ (2, 16, 4096, 4, 4096, 128) ┆ 111.552594        ┆ 114.380841           ┆ 1.025353                  ┆ 2.535349  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, 1024, 64) ┆ 74.830149         ┆ 76.685445            ┆ 1.024793                  ┆ 2.479344  │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 64)  ┆ 55.279932         ┆ 56.369312            ┆ 1.019707                  ┆ 1.97066   │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 4096, 4, 4096, 64)  ┆ 111.08814         ┆ 112.447047           ┆ 1.012233                  ┆ 1.22327   │
└────────────────┴────────────────┴─────────────────────────────┴───────────────────┴──────────────────────┴───────────────────────────┴───────────┘

🔻 Top 5 Cases Where no_peel (change) is Slower than base (baseline):
shape: (5, 7)
┌────────────────┬────────────────┬─────────────────────────────┬───────────────────┬──────────────────────┬───────────────────────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)       ┆ TFlops BWD (base) ┆ TFlops BWD (no_peel) ┆ no_peel_speedup_over_base ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                         ┆ ---               ┆ ---                  ┆ ---                       ┆ ---       │
│ str            ┆ str            ┆ str                         ┆ f64               ┆ f64                  ┆ f64                       ┆ f64       │
╞════════════════╪════════════════╪═════════════════════════════╪═══════════════════╪══════════════════════╪═══════════════════════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, 1024, 128) ┆ 89.221306         ┆ 86.295642            ┆ 0.967209                  ┆ -3.27911  │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 4, 1024, 64)  ┆ 78.23082          ┆ 76.693169            ┆ 0.980345                  ┆ -1.965531 │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95663          ┆ 95.573333            ┆ 0.985733                  ┆ -1.426717 │
│ alibi          ┆ torch.bfloat16 ┆ (4, 16, 2048, 4, 2048, 64)  ┆ 93.373473         ┆ 92.294147            ┆ 0.988441                  ┆ -1.155924 │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 2048, 4, 2048, 128) ┆ 96.95147          ┆ 96.105389            ┆ 0.991273                  ┆ -0.872685 │
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162295
Approved by: https://github.com/mlazos, https://github.com/v0i0
2025-09-10 21:33:45 +00:00
4e35594674 [Lowering] Fix the edge case of empty subgraph split due to dataclass node (#161716)
Summary: Fix the edge case by allowing `call_function` nodes with no deps as graph entry (starter_nodes) in the splitter.

Test Plan:
The test shall pass in the current diff (after fix), and fail in the parent diff (before fix)

```
buck test mode/opt //glow/fb/fx/lowering:split_tests -- test_dataclass_as_graph_entry
```

Rollback Plan:

Differential Revision: D81232435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161716
Approved by: https://github.com/ezyang
2025-09-10 21:23:42 +00:00
35d7b32159 Improve device info with new flops and bandwidth formula based on hardware libraries (#162245)
Previously, DeviceInfo provided theoretical hardware information based on a hardcoded list manually created from various datasheets.

This update:
- Attempting to gather the information from a hardware library like `pynvml`, improving accuracy and expanding support to devices that don't have entries in the datasheet list.
- Adjusts flops and bw calculation based on these hardware values. For example, if the the memory or SMs are underclocked, it adjusts the theoretical max flops/bw accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162245
Approved by: https://github.com/v0i0, https://github.com/shunting314
2025-09-10 21:19:13 +00:00
0663bdb123 Move inductor jobs 3.9->3.10 (#162323)
Related to: https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162323
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-09-10 20:58:41 +00:00
40ea6e418a Revert "Fix decorators skipping NCCL tests (#158846)"
This reverts commit c2388201fc85b0748173212de5a17514c7a71f21.

Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some inductor tests ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3276471387))
2025-09-10 20:51:31 +00:00
348303ebd2 [ez] add docstring/typing for codegen_kernel_benchmark (#162609)
```
lintrunner init && lintrunner -m origin/main
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162609
Approved by: https://github.com/coconutruben
ghstack dependencies: #162442
2025-09-10 20:49:38 +00:00
94755e81c4 [inductor] Enable combo kernels with unbacked inputs (#162442)
Internal user tried enabling combo kernels, but ran into "Cannot convert symbols to int". This PR is to enable combo kernels on inputs with data-dependent shapes.

### Example exception

```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 4997, in benchmark_combo_kernel
    kernel_code_list = self.generate_combo_kernel_code(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/simd.py", line 1849, in generate_combo_kernel_code
    src_code = kernel.codegen_kernel()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 802, in codegen_kernel
    code.splice(self.codegen_kernel_benchmark(num_gb=0))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 852, in codegen_kernel_benchmark
    var_names.extend(self.kernel_benchmark_extra_args())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton_combo_kernel.py", line 733, in kernel_benchmark_extra_args
    extra_args.append(str(V.graph.sizevars.size_hint(tree.numel)))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 584, in size_hint
    return int(out)
           ^^^^^^^^
  File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/site-packages/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
torch._inductor.exc.InductorError: TypeError: Cannot convert symbols to int
```

Differential Revision: [D82042230](https://our.internmc.facebook.com/intern/diff/D82042230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162442
Approved by: https://github.com/jansel
2025-09-10 20:49:38 +00:00
6d65737aee testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
2025-09-10 20:48:12 +00:00
053251b98d Revert "Make functorch notebook symlinks PEP 517 valid (#157813)"
This reverts commit b494547f0bd6cb1ce5d8d104cb419802434c9c08.

Reverted https://github.com/pytorch/pytorch/pull/157813 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but this surfaces a weird discrepancy between GitHub and Mecurial used internally ([comment](https://github.com/pytorch/pytorch/pull/157813#issuecomment-3276442242))
2025-09-10 20:45:48 +00:00
7e2e83cdbe [ONNX] Update export docstring (#162622)
Update export docstring to reflect the latest configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162622
Approved by: https://github.com/titaiwangms
2025-09-10 20:29:46 +00:00
d033d11d26 Revert "[torch][c10d] fix split_group in mixed backend case (#162424)"
This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d.

Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))
2025-09-10 20:13:44 +00:00
80d4da893c Revert "Put torchao (0.13.0) back to benchmark workflow (#162227)"
This reverts commit 00985970e312c3c5e674e8e14d39fe77c226600e.

Reverted https://github.com/pytorch/pytorch/pull/162227 on behalf of https://github.com/huydhn due to Crashing some inductor jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/162227#issuecomment-3276355034))
2025-09-10 20:11:37 +00:00
bf7f481144 Update misleading torch.sparse_coo_tensor error check (#161900)
Fixes #160622

### Summary
Updated the misleading torch.sparse_coo_tensor error check to provide clear context.
earlier:
`RuntimeError: number of dimensions must be sparse_dim (3) + dense_dim (0), but got 1`

Updated:
`RuntimeError: 'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = 1, sparse_dim = 3, dense_dim = 0`

**Impacts:**

- Comprehensive error message that will improve developer experience.
- module: sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161900
Approved by: https://github.com/nikitaved, https://github.com/pearu
2025-09-10 19:57:11 +00:00
ab0694f1c6 [ROCm][Inductor][CK backend] Install rocm-composable-kernel python package on ROCm Linux CI docker images (#162288)
Reopened from #158747 which got reverted since without setuptools-scm in pytorch index URL the wheel cannot be built

We reconsider the original PR idea of introducing CK as a pytorch dependency on ROCm Linux and install the CK python package in CI only -- since (1) rocm-composable-kernel depends on setuptools-scm which depends on tomli and the existing index URLs need to be modified to host the new packages and (2) there also is a packaging [bug](https://github.com/pypa/setuptools/issues/3269#issuecomment-1254507377) in Ubuntu 22.04 which prevents correct dynamic version calculation with default system pip.

Extras:

 ->   this PR reconsiders how TORCHINDUCTOR_CK_DIR env variable is used; previously, this var was used to point to rocm-composable-kernel package installation path on the filesystem; now, the path is inferred by trying to import ck4inductor
 ->   the tests are updated to reflect this change
 ->   since in CI clang points to a bash script which invokes sccache, we cannot patch PATH to not contain sccache, this logic is removed from the testing code
->    scaled_mm test crashes during the benchmarking when the benchmarking happens in the main process, and times out benchmarking when it happens in a subprocess, on gfx942, so it is disabled

TBD: roll back rocm-mi300 workflow before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162288
Approved by: https://github.com/jeffdaily
2025-09-10 19:33:40 +00:00
5f630d28d7 [dynamo][guards] Do not construct entire framelocals dict for LAMBDA_GUARD (#162525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162525
Approved by: https://github.com/williamwen42
ghstack dependencies: #162509
2025-09-10 18:52:15 +00:00
a67e798cb7 [dynamo][guards] Prevent framelocals to dict conversion for not required LAMBDA_GUARD (#162509)
This is a smaller PR to reduce framelocals to dict conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162509
Approved by: https://github.com/williamwen42
2025-09-10 18:52:15 +00:00
30191fcf03 [inductor][choices] rename get_mm_configs to get_template_configs (#162293)
# why

- eventually we want all templates to go through this
- we're exposing this through diode as a sort of interface/API
- avoid later renaming

# what

- rename get_mm_configs to get_template_configs
- rename _finalize_mm_configs to _finalize_template_configs

# testing

- lintrunner
- ci

Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293
Approved by: https://github.com/eellison
ghstack dependencies: #161351, #161350
2025-09-10 18:47:44 +00:00
623e623c82 [inductor] leverage template stacking in V.choices.get_mm_configs (#161350)
# why

- now everything is in place to just gather templates and run
  the V.choices.get_mm_configs once per op
- enables any overrides inside V.choices.get_mm_configs to
  have a full view of the options for an op, not just for
  one template

# what

- replace multiple calls to V.choices.get_mm_configs with
  calls to gather the active templates, and then using those
  in a single call

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161351
2025-09-10 18:47:44 +00:00
f08487aa86 [inductor] FlexibleLayout for ExternKernelChoice for mms (#161351)
# why

- if we only use ExternKernelChoice we're not doing any codegen
- if we're not doing any codegen, we can use a FlexibleLayout
  here, and provide deeper passes more chances to change it

# what

- if all the kernel template choices (KTC) are with a ExternKernelChoice
  template, we switch to a FlexibleLayout before generating the choice
- add a test to make sure that works as intended (FlexibleLayout for
  only extern, and FixedLayout if Triton is involved)

- caveats:
    - because CPP, CUTLASS, and CK are not using
       V.choices.get_mm_configs yet, we turn off the optimization
       if either of those backends are in use. This will be relaxed
       once they support this too
    - because Triton templates are still using their own calls
       (not a single call) to get_mm_configs, it's also turned
       off there. The next diff unifies Triton + ATEN to a single
       call to get_mm_configs and that in turn allows the optimization
       there too

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-10 18:47:34 +00:00
1051c7dbc2 Don't unconditionally import torch._dynamo, it's slow (#162595)
A trivial test on OS X.

Before:

```
real	0m6.550s
user	0m2.532s
sys	0m3.359s
```

After:

```
real	0m2.607s
user	0m1.898s
sys	0m3.344s
```

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162595
Approved by: https://github.com/albanD
2025-09-10 17:21:03 +00:00
suo
2dc2613180 [torch][c10d] fix split_group in mixed backend case (#162424)
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.

However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.

This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.

Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.

As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
2025-09-10 16:59:18 +00:00
582d278983 Build and Install Arm Compute Library in manylinux docker image (#159737)
----

This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms.

In this PR:

- We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build.  Also updated jammy install path to be /acl too.
- We can therefore remove build_ArmComputeLibrary functions from the ci build scripts.
- There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update )
- We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ).
- ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737
Approved by: https://github.com/seemethere
ghstack dependencies: #160078
2025-09-10 15:39:38 +00:00
b5e6e58050 [nn] Assert parsed iterable arguments are an appropriate length (#162340)
Fixes #162327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162340
Approved by: https://github.com/Skylion007
2025-09-10 15:15:49 +00:00
fefc406a3d fix typo: summit -> submit (#162587)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162587
Approved by: https://github.com/justinchuby
2025-09-10 14:43:53 +00:00
3d32bb114b [CD] Aarch64 Fix packaging `libarm_compute.so` and other libraries to the aarch64 CUDA wheels (#162566)
Fixes aarch64 linux packaging, following error:
https://github.com/pytorch/vision/actions/runs/17612462583/job/50037380487#step:15:62
```
Traceback (most recent call last):
  File "/__w/vision/vision/pytorch/vision/setup.py", line 13, in <module>
    import torch
  File "/__w/_temp/conda_environment_17612462583/lib/python3.11/site-packages/torch/__init__.py", line 415, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libarm_compute.so: cannot open shared object file: No such file or directory
```
Due to missing dependencies.

Current Error:
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl renamed as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
Hence the repackaging does not take any effect.

This PR does following
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl is extracted
File torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl  deleted
File is repackaged as torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl

Looks like after migrating from zipping the wheel to wheel pack renaming the wheel is no longer necessary. Hence removing renaming and deleting old file.
```
2025-09-10T10:10:05.9652454Z Using nvidia libs from pypi - skipping CUDA library bundling
2025-09-10T10:10:05.9656595Z Copying to /pytorch/dist/tmp/torch/lib/libgomp.so.1
2025-09-10T10:10:05.9873843Z Copying to /pytorch/dist/tmp/torch/lib/libgfortran.so.5
2025-09-10T10:10:06.0410041Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute.so
2025-09-10T10:10:06.2869242Z Copying to /pytorch/dist/tmp/torch/lib/libarm_compute_graph.so
2025-09-10T10:10:06.4385740Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_lp64_gomp.so.0
2025-09-10T10:10:06.5461372Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_lp64_gomp.so.0
2025-09-10T10:10:06.5728970Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_lapack_core.so.0
2025-09-10T10:10:06.6231872Z Copying to /pytorch/dist/tmp/torch/lib/libnvpl_blas_core.so.0
2025-09-10T10:10:14.1503110Z Updated tag from Tag: cp310-cp310-linux_aarch64
2025-09-10T10:10:14.1503482Z  to Tag: cp310-cp310-manylinux_2_28_aarch64
2025-09-10T10:10:14.1503682Z
2025-09-10T10:10:41.6498892Z Repacking wheel as /pytorch/dist/torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
2025-09-10T10:10:41.9394460Z Renaming torch-2.10.0.dev20250910+cu130-cp310-cp310-linux_aarch64.whl wheel to torch-2.10.0.dev20250910+cu130-cp310-cp310-manylinux_2_28_aarch64.whl
```

Test Plan, Executed on local file:
```
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/WHEEL
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/entry_points.txt
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/top_level.txt
  inflating: ubuntu/dist/tmp/torch-2.9.0.dev20250909+cu130.dist-info/RECORD
Bundling CUDA libraries with wheel
Updated tag from Tag: cp310-cp310-manylinux_2_28_aarch64
 to Tag: cp310-cp310-manylinux_2_28_aarch64

Repacking wheel as ubuntu/dist/torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl...OK
Copying torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl to artifacts
Build Complete. Created torch-2.9.0.dev20250909+cu130-cp310-cp310-manylinux_2_28_aarch64.whl..
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162566
Approved by: https://github.com/jeanschmidt, https://github.com/NicolasHug
2025-09-10 14:22:41 +00:00
de05dbc39c Replace export_for_training with export (#162396)
Summary: replace export_for_training with epxort

Test Plan:
CI

Rollback Plan:

Differential Revision: D81935792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162396
Approved by: https://github.com/angelayi, https://github.com/jerryzh168
2025-09-10 14:19:34 +00:00
fc1b09a52a Revert "Fix DCE eliminating in-place operations by improving Node.is_impure() (#162267)"
This reverts commit b9a7d0e13b4a34be83c778734dbad437c7c5117b.

Reverted https://github.com/pytorch/pytorch/pull/162267 on behalf of https://github.com/malfet due to Not sure how it happened, but looks like it broke everything, see c2388201fc/1 ([comment](https://github.com/pytorch/pytorch/pull/162267#issuecomment-3275164109))
2025-09-10 14:12:22 +00:00
c2388201fc Fix decorators skipping NCCL tests (#158846)
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`

In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup

Using `unittest.skip` decorators avoids the starting of the test in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
2025-09-10 12:25:42 +00:00
a6f9e0e62a [c10d][nvshmem] fix override function modifier (#162515)
Summary: Fix compilation error in fbsource by missing override modifier

Differential Revision: D82038876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162515
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-09-10 11:35:49 +00:00
337fe1079d [nativert] AOTI delegate with flat inputs and outputs (#162538)
Summary: `executorch_call_delegate` should have flattened inputs and outputs. So that it can be correctly serialized and the input/output specs are consistent with runtime.

Test Plan:
CI

Rollback Plan:

Differential Revision: D82064354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162538
Approved by: https://github.com/dolpm
2025-09-10 11:35:44 +00:00
b494547f0b Make functorch notebook symlinks PEP 517 valid (#157813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157813
Approved by: https://github.com/zou3519, https://github.com/atalman
2025-09-10 10:13:24 +00:00
d9832d8425 [triton][export] serialization in internal path + unit tests (#162200)
Summary: will package triton artifacts to be runnable in nativert if wrappers exist.

Test Plan:
unit tests

Rollback Plan:

Differential Revision: D81368559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162200
Approved by: https://github.com/angelayi
2025-09-10 09:49:10 +00:00
f0ae3a57f6 [Optimus] Add batch dropout pattern (#162443)
Summary: We observe dropout pattern in AFOC, such add a new pattern to Optimus

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_batch_dropout_pre_grad_fusion
```

Buck UI: https://www.internalfb.com/buck2/2c899fb5-6e8b-43eb-8fb3-b53abfbfa6d9
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598805248688
Network: Up: 0B  Down: 0B  (reSessionID-bfbb9e6a-7e2a-425a-a027-b44282cef419)
Executing actions. Remaining     0/3                                                                                                     1.3s exec time total
Command: test.     Finished 2 local
Time elapsed: 1:22.3s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

### E2E

baseline
f791163796

proposal
f793225207

Rollback Plan:

Differential Revision: D81981264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162443
Approved by: https://github.com/Yuzhen11, https://github.com/mlazos
2025-09-10 09:49:01 +00:00
26b3ae5890 Move prioritized text linker optimization code from setup.py to cmake (#160078)
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.

### Summary

🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )

This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.

### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.

Note:

Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.

Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
2025-09-10 09:21:53 +00:00
be8095b07f [DeviceMesh] Clarifying flatten use case (#161311)
Since we are in the middle of big refactoring and simplying the bookkeeping for device mesh. We found an interesting bug inside DeviceMesh flatten implementation. Here is the finding:
1. In unit test, we assume users can call `dp_cp_mesh._flatten()` many times but no backend will be created (aka cached).
2. From the implementation of slicing, we actually throw exception erroring out doing the `_flatten` more than once. But there is bug which was partially fixed in https://github.com/pytorch/pytorch/pull/160709 but it does not fixed the check for the case when we call the `_flatten` twice.

What's more important question to ask is, what behavior we want for `_flatten`? Do we allow calling `_flatten` multiple times (with same mesh_name)? I think we should, why?
1. We allow slicing for the same mesh_name or name_list multiple times, and we cache the PG behinds. Although we will return a new device mesh object everytime, when we compare them they are all the same (according to __eq__).
2. We actually cached the flattened mesh today inside `root_to_flatten_mapping` and actually do the early return but that  line will never be reached if we error out before that.

Also we should allow a no-op for flatten a 1D mesh into itself's mesh_dim_name, I added a unit test for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161311
Approved by: https://github.com/fegin
2025-09-10 07:46:51 +00:00
b2d8f6a6af [OpenReg] Update the docs about Accelerator Integration (#162046)
Fix the issue describled by this [comment](https://github.com/pytorch/pytorch/pull/161845#discussion_r2317299390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162046
Approved by: https://github.com/albanD
2025-09-10 07:45:07 +00:00
98e22c8a69 Skip test_ind_worker_queue on Windows and macOS (flaky) (#162555)
Fixes https://github.com/pytorch/pytorch/issues/68643

It was closed by the bot yesterday and the issue was still there https://github.com/pytorch/pytorch/actions/runs/17595694816/job/49989589647.  It's better to just skip it directly in the code as this test has been disabled on Windows and MacOS since 2021 O_o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162555
Approved by: https://github.com/clee2000
2025-09-10 07:05:14 +00:00
e1f0a69943 Revert "test fixing benchmarks (#162503)"
This reverts commit 484c4093a87a3e6767e55ed553f95db8fc137442.

Reverted https://github.com/pytorch/pytorch/pull/162503 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it regresses CPU perf smoke test ([comment](https://github.com/pytorch/pytorch/pull/162503#issuecomment-3273554680))
2025-09-10 06:55:35 +00:00
833997a6fd [Inductor][UT] Fix flex attention related inductor cases (#162450)
## Motivation
Fixes #162435, Fixes #162436

UT failures:
* https://github.com/pytorch/pytorch/actions/runs/17523991468/job/49772651636
* https://github.com/pytorch/pytorch/actions/runs/17523991468/job/49772651637

To fix flex attention related cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162450
Approved by: https://github.com/drisspg
2025-09-10 06:48:00 +00:00
b9a7d0e13b Fix DCE eliminating in-place operations by improving Node.is_impure() (#162267)
Change is_impure to check in-place operations on Node to prevent eliminate_dead_code from eliminating in-place operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162267
Approved by: https://github.com/ezyang
2025-09-10 06:02:15 +00:00
1c16c18a53 [nativert][triton] improve hardware registration (#162499)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D82031814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162499
Approved by: https://github.com/angelayi
2025-09-10 04:52:57 +00:00
96ef26f71a Revert "[ROCm] Integrate AITER Fav3 fwd kernels (#160105)"
This reverts commit d2393c2d7da03a1523a12e6f80edb6bd7b464ec5.

Reverted https://github.com/pytorch/pytorch/pull/160105 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internal ROCm build ([comment](https://github.com/pytorch/pytorch/pull/160105#issuecomment-3273297183))
2025-09-10 04:42:28 +00:00
5ac112b569 [dynamo] Graph break on on user-defined class in compiled region (#161670)
Currently, user-defined classes inside of a compiled frame will cause the whole
frame to be skipped by dynamo.  This change defers the Unsupported exception
until the __build_class__ builtin is actually called, which allows a graph break
to be inserted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas
2025-09-10 04:39:20 +00:00
dda071587f Revert "Make distributed modules importable even when backend not built (#159889)" (#162568)
This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568
Approved by: https://github.com/huydhn
2025-09-10 04:29:42 +00:00
11acfed3ce [audio hash update] update the pinned audio hash (#162552)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162552
Approved by: https://github.com/pytorchbot
2025-09-10 04:24:39 +00:00
5f40a8a9a3 [BE] Fix '_WIN32' is not defined warning (#162516)
Summary: As indeed it is not defined neither on  Linux nor on MacOS platforms

Test Plan:
CI

Rollback Plan:

Differential Revision: D82044853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162516
Approved by: https://github.com/Skylion007
2025-09-10 04:21:38 +00:00
e64965300a Repackage vLLM nightlies (#162371)
I suspected that I would need to repack vLLM wheels from https://github.com/pytorch/pytorch/pull/162000 because I renamed the wheel, and it turns out to be true.  The error is as follows:

```
$ uv pip install --pre xformers --index-url https://download.pytorch.org/whl/nightly/cu129
Using Python 3.12.11+meta environment at: venv/py3.12
Resolved 28 packages in 759ms
error: Failed to install: xformers-0.0.33.dev20250901+cu129-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (xformers==0.0.33.dev20250901+cu129)
  Caused by: Wheel version does not match filename: 0.0.33+5d4b92a5.d20250907 != 0.0.33.dev20250901+cu129
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162371
Approved by: https://github.com/atalman
2025-09-10 04:02:34 +00:00
00985970e3 Put torchao (0.13.0) back to benchmark workflow (#162227)
0.13.0 was released on Sep 3rd https://pypi.org/project/torchao/#history, which should have fixed the crashing issue on transformers now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162227
Approved by: https://github.com/malfet
2025-09-10 03:56:25 +00:00
484c4093a8 test fixing benchmarks (#162503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162503
Approved by: https://github.com/huydhn
ghstack dependencies: #160741
2025-09-10 03:15:49 +00:00
760c478a14 [FlexAttn][Minor] Update FlexConfig doc (#162533)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162533
Approved by: https://github.com/drisspg
2025-09-10 02:03:48 +00:00
dc4f97e9c1 [triton] enable int64 indexing in convolution and mm template (#162506)
Summary: hitting illegal memory access issue when compiling conv and addmm kernels with the change in https://github.com/pytorch/pytorch/pull/157767

Differential Revision: D81995664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162506
Approved by: https://github.com/iseeyuan
2025-09-10 01:53:26 +00:00
c66e58b7d0 [ONNX] Expose the testing module (#162495)
* Created a new module `torch/onnx/testing.py` that exposes the `assert_onnx_program` function for testing exported ONNX models.
* Updated the ONNX documentation (`docs/source/onnx.md`) to include `onnx_testing` in the list of relevant modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162495
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2025-09-10 01:40:24 +00:00
878f59ef75 DeviceMesh: support _rank for use with non-global PGs (#162439)
Summary: This adds a `_rank` field to DeviceMesh init that allows for instantiating a DeviceMesh without depending on `dist.get_rank()` which requires a global PG to be instantiated.

Test Plan:
```
buck2 test mode/opt -c fbcode.enable_gpu_sections=true  //caffe2/test/distributed:device_mesh -- init_backend
```

Rollback Plan:

Differential Revision: D81981777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162439
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2025-09-10 01:18:28 +00:00
e60ad4f628 [DTensor] fix copy_ strategy to support linearity (#162460)
Fixing issue introduced in https://github.com/pytorch/pytorch/pull/158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.

In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.

This was discovered from silent incorrect results e.g. on `torch.einsum` backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162460
Approved by: https://github.com/zpcore
2025-09-10 00:47:14 +00:00
2281d009e5 Revert "[ROCm] Add specific compile options for CK SDPA (#161759)"
This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e.

Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726))
2025-09-10 00:44:30 +00:00
33589374b6 [DCP] Avoid multiple storage writer resets in async save (#159448)
Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path.

Test Plan:
```
buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test
```
https://www.internalfb.com/intern/testinfra/testrun/15199648841705052

Rollback Plan:

Differential Revision: D79230339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448
Approved by: https://github.com/meetv18
2025-09-10 00:43:03 +00:00
5539916fe1 [dynamo][refactor] Move get_framelocals_idx to a helper (#162519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162519
Approved by: https://github.com/williamwen42
2025-09-10 00:35:09 +00:00
e4174b1fd7 remove gso from collapse_view_helper (#162212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162212
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@fb.com>
2025-09-10 00:17:15 +00:00
0e7ccc09db [easy] Don't force copy result of getAllOperatorsFor in init.cpp (#162218)
It returns a const reference to a vector.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162218
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219, #162220
2025-09-10 00:08:15 +00:00
87cc126457 [associative_scan] partial gradient support (#162388)
This PR tests the partial gradient support of the `associative_scan` operation. It replaces https://github.com/bohnstingl/pytorch/pull/6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162388
Approved by: https://github.com/ydwu4
2025-09-09 23:52:29 +00:00
a3e26d1727 Revert "[dynamo] Graph break on on user-defined class in compiled region (#161670)"
This reverts commit e2545487de3dbbe663e3f0adb699547a14da0f6a.

Reverted https://github.com/pytorch/pytorch/pull/161670 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing a trunk test ([comment](https://github.com/pytorch/pytorch/pull/161670#issuecomment-3272626391))
2025-09-09 23:40:26 +00:00
d2393c2d7d [ROCm] Integrate AITER Fav3 fwd kernels (#160105)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160105
Approved by: https://github.com/jeffdaily
2025-09-09 22:30:12 +00:00
b498299953 154849 Add support to handle IGUSR1 and SIGUSR2 in multiprocessing (#160690)
Fixes #154849

This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments.  Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api

Tests:
For validations purpose:
test_signal_handling.py,
simple_test_api_signal_handling.py,

Unit Tests:
for launcher changes:launcher/test_api.py
for api changes:  multiprocessing/test_api.py
E2E: test_run.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690
Approved by: https://github.com/fduwjj
2025-09-09 22:23:06 +00:00
4d66a3b894 fix Dtensor doc link (#162494)
Small fix for https://docs.pytorch.org/docs/main/distributed.tensor.parallel.html
<img width="890" height="274" alt="image" src="https://github.com/user-attachments/assets/6ee7fc7c-e0fe-4f5e-ab7e-a895bb3fa79f" />

now it is:

<img width="909" height="320" alt="image" src="https://github.com/user-attachments/assets/8b2c41ef-1684-4597-8dae-144b49723796" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162494
Approved by: https://github.com/XilunWu
2025-09-09 22:10:37 +00:00
e2545487de [dynamo] Graph break on on user-defined class in compiled region (#161670)
Currently, user-defined classes inside of a compiled frame will cause the whole
frame to be skipped by dynamo.  This change defers the Unsupported exception
until the __build_class__ builtin is actually called, which allows a graph break
to be inserted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161670
Approved by: https://github.com/williamwen42, https://github.com/guilhermeleobas
2025-09-09 21:07:49 +00:00
8922bbcaab Use same NVSHMEM version across CUDA builds (#162206)
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
2025-09-09 20:59:50 +00:00
14744e1ab2 [Release 2.9] Add compatibility matrix, Version Bump (#162526)
Release 2.9
1. Add release compatibility matrix
2. Add version bump for 2.10
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162526
Approved by: https://github.com/malfet
2025-09-09 20:38:15 +00:00
b477fb106f [ROCm] enable grouped gemm fallback (#162419)
Enables bf16 group gemm alternative path as described in #161366
Fast path will be enabled in future through CK integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 20:04:56 +00:00
d22d916719 [ROCm] Add specific compile options for CK SDPA (#161759)
Updates CK version and adds CK specific compilation options

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759
Approved by: https://github.com/jeffdaily
2025-09-09 20:04:19 +00:00
86d34a43f5 NamedTuple: Allow side effects for dynamic attributes (#161645)
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it.

The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple.

Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object.

Fixes #161610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645
Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
2025-09-09 19:42:02 +00:00
8508651477 Fix flaky AOTFxirTestCase (#162472)
Fixes https://github.com/pytorch/pytorch/issues/162357
Fixes https://github.com/pytorch/pytorch/issues/160970
Fixes https://github.com/pytorch/pytorch/issues/161038
Fixes https://github.com/pytorch/pytorch/issues/160951
Fixes https://github.com/pytorch/pytorch/issues/161698

These tests were introduced in https://github.com/pytorch/pytorch/pull/160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option).  The issue could be reproduced by running them locally multiple times.  For example,

```
pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add
(output logs at P1938386961)
...
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
================================================================================================================================================= short test summary info ==================================================================================================================================================
FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
=============================================================================================================================================== 3 failed, 7 passed in 12.89s ===============================================================================================================================================
```

Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from https://github.com/pytorch/pytorch/pull/160765.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162472
Approved by: https://github.com/angelayi, https://github.com/Skylion007
2025-09-09 19:39:24 +00:00
723c27ed78 [standalone_compile] binary format write should be atomic (#162432)
We update it to call write_atomic instead of file.write

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162432
Approved by: https://github.com/oulgen
2025-09-09 18:43:13 +00:00
bdbe931d58 [build] Add LeakSanitizer option to CMake (#158686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158686
Approved by: https://github.com/eellison
2025-09-09 18:41:20 +00:00
af60398c3a Update the operator benchmarking, to benchmark using torch.compile (#161394)
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:

- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name

Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
2025-09-09 18:17:37 +00:00
82f1eb9b03 Revert "[MPS] mps sparse mul op implementation (#162349)"
This reverts commit 3ea686804925f1291de57ffdb3394da0b46deb54.

Reverted https://github.com/pytorch/pytorch/pull/162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](https://github.com/pytorch/pytorch/pull/162349#issuecomment-3271783442))
2025-09-09 18:14:16 +00:00
4b2d297eec python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580)
My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps:

(1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop.

(2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++)

I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated

This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us):

**aten.detach(plain_tensor)**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790>
_ = x.detach()
  4.80 us
  1 measurement, 100000 runs , 1 thread
```

**aten.detach(DTensor) [before this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750>
_ = x_dt.detach()
  336.40 us
  1 measurement, 1000 runs , 1 thread
```

**aten.detach(DTensor) [after this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520>
_ = x_dt.detach()
  Median: 133.45 us
  2 measurements, 1000 runs per measurement, 1 thread
```

benchmark script:
```
import torch
import torch.distributed as dist
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
import torch.utils.benchmark as benchmark

fake_store = FakeStore()
dist.init_process_group("fake", store=fake_store, rank=0, world_size=2)

mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,))
x = torch.randn(4, 4, requires_grad=True)
x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)

t0 = benchmark.Timer(
    stmt='_ = x_dt.detach()',
    globals={'x_dt': x_dt},
)
print(t0.blocked_autorange())

dist.destroy_process_group()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580
Approved by: https://github.com/ezyang
2025-09-09 18:04:56 +00:00
0ec723acd0 Update docs for quantile to be clearer for nearest (#162423)
Correct the rounding scheme for nearest in quantile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162423
Approved by: https://github.com/soulitzer
2025-09-09 18:04:12 +00:00
e1be887870 [PP] Add spacing to visualizer (#160474)
When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this:

<img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" />

While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this:

<img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" />

This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change:

<img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160474
Approved by: https://github.com/fegin
2025-09-09 17:52:52 +00:00
d91eecc9a5 [inductor][template heuristics] don't take layout to generate choices (#162238)
# why

- unnecessary as we only ever need to know the dtype and maybe the
  device
- we already take in the kernel inputs which have the device
- enable us to specify the layout after finding all the configs
  but before generating the ChoiceCallers

# what

- replace all calls in template_heuristics that used to take Layout
  with now just taking out_dtype

# testing

ci

Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162238
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348, #161349
2025-09-09 17:17:04 +00:00
24a4dae85b [inductor] V.choices.get_mm_configs override point (#161349)
# why

- enable us to override the default configs, or fall back to them
  through subclassing InductorChoices

# what

- override (private) function
- default implementationt takes the kernel template choice (ktc)
  generator for every template and just executes the generator
- future overrides can decide to replace those generators, or filter
  out choices

- the 2nd expensive step (maybe_append_choices, choice_or_none) is
  handled outside this function, in the main V.choices.get_mm_configs
  this means that any overriding benefits from not generating expensive
  templates that aren't going to be used

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161349
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348
2025-09-09 17:17:04 +00:00
d3c4cf838e [inductor][ez] V.choices.get_mm_configs returns list of ChoiceCallers (#161348)
\# why

- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
  generators are executed, so we don't need this generator anymore

\# what

- rather than yielding the ChoiceCaller, just return the list of all
  valid ChoiceCallers

\# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #161347
2025-09-09 17:16:57 +00:00
b1e99c8c7a [inductor] add kernel template choice (ktc) (#161347)
# why

- gather everything up to make choices, without running
  potentially expensive generators
- enables overrides where we toss the entire list of configs
  from inductor, without having to enumrate it (expensive)

# what

- add a holding class that just gets all the components necessary
  to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
  the scene

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
2025-09-09 17:16:50 +00:00
5eb35d2ab8 [CUDA][float8][TF32] Disable tf32 for vs. emulated rowwise comparison (#162387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162387
Approved by: https://github.com/Skylion007
2025-09-09 17:04:06 +00:00
f03d635dc6 [ROCm][CI] skip test_max_autotune until resolved (#162496)
many tests taking >30 min and causing timeouts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162496
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 16:34:01 +00:00
1f0b01d4b6 [ROCm] OffsetCalc Unroll Optimization (#161700)
Our compiler is generating inefficient code for the offsetCalc in certain situations.
The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161700
Approved by: https://github.com/jeffdaily
2025-09-09 16:11:48 +00:00
c0142f5c06 [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@fb.com>
2025-09-09 15:49:21 +00:00
3ea6868049 [MPS] mps sparse mul op implementation (#162349)
Implements mps sparse mul operation as well as enables other operations such as:
1. copy_
2. div
3. sum
4. floor
5. power
6. sub
7. floor_divide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349
Approved by: https://github.com/pearu, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-09 15:45:37 +00:00
be3b8d2ec9 [ROCm][CI] update fbgemm nightly benchmark hash (#162385)
fbgemm_gpu was failing to clone due to missing submodule commit.
```
+ pushd fbgemm/fbgemm_gpu
~/pytorch/fbgemm/fbgemm_gpu ~/pytorch
+ git checkout 7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8 --recurse-submodules
fatal: failed to unpack tree object b1281b8b08d973a7064f864f47eeb30f3e2596e9
error: Submodule 'external/composable_kernel' could not be updated.
error: Cannot update submodule:
	external/composable_kernel
```
Log File
[inductor-periodic · pytorch/pytorch@5babb4d](https://github.com/pytorch/pytorch/actions/runs/17536630806/job/49802458834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162385
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 15:44:39 +00:00
5ccf3ca3ec Revert "Use same NVSHMEM version across CUDA builds (#162206)"
This reverts commit 0d9c95cd7ee299e2e8c09df26d395be8775b506b.

Reverted https://github.com/pytorch/pytorch/pull/162206 on behalf of https://github.com/malfet due to Broke lint, see 4dd73e659a/1 ([comment](https://github.com/pytorch/pytorch/pull/162206#issuecomment-3271040521))
2025-09-09 14:40:45 +00:00
e38e953432 CUDA 13.0 Windows Nvidia Driver Update to 580.88 (#162425)
Related to https://github.com/pytorch/pytorch/issues/162333
https://github.com/pytorch/pytorch/issues/159779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162425
Approved by: https://github.com/tinglvv, https://github.com/malfet
2025-09-09 14:40:34 +00:00
4dd73e659a Revert "fix torch.sparse.log_softmax on CPU (#161959)"
This reverts commit 002e59440afe8711019e68df500f5e18b9a43f3c.

Reverted https://github.com/pytorch/pytorch/pull/161959 on behalf of https://github.com/davidberard98 due to test failure: test_sparse.py::TestSparseMPS::test_log_softmax_float_mps_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17573794461/job/49915138287) [HUD commit link](002e59440a) ([comment](https://github.com/pytorch/pytorch/pull/161959#issuecomment-3270509418))
2025-09-09 12:33:25 +00:00
0d9c95cd7e Use same NVSHMEM version across CUDA builds (#162206)
#161321 bumped NVSHMEM version to 3.3.24 for CUDA 13, leaving CUDA 12 with 3.3.20.
This PR bumps the NVSHMEM version to 3.3.24 for CUDA 12 as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162206
Approved by: https://github.com/tinglvv, https://github.com/Skylion007
2025-09-09 08:52:27 +00:00
dcc42e95f4 Fix missing moves in initJITBindings (#162428)
Per @Skylion007 on #162219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162428
Approved by: https://github.com/Skylion007
2025-09-09 08:47:33 +00:00
002e59440a fix torch.sparse.log_softmax on CPU (#161959)
Fix https://github.com/pytorch/pytorch/issues/152293.

**Example:**
```
import torch
from torch.sparse import log_softmax as sparse_log_softmax

def test_bug():
    a = torch.rand(4, 3)
    b = a - 10000000.0
    b_sparse = b.to_sparse()

    cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense()
    print('cpu_out_sparse =', cpu_out_sparse)

    b_sparse_double = b.double().to_sparse()
    cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense()
    print('cpu_out_sparse_double =', cpu_out_sparse_double)

if __name__ == '__main__':
    test_bug()
```

**Output:**

- before
```
cpu_out_sparse = tensor([[-2., -1., -2.],
        [-1., -1., -1.],
        [-1., -2., -2.],
        [-1., -1., -2.]])
cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514],
        [-1.0986, -1.0986, -1.0986],
        [-0.5514, -1.5514, -1.5514],
        [-0.8620, -0.8620, -1.8620]], dtype=torch.float64)
```

- after
```
cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]])
cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]], dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959
Approved by: https://github.com/Skylion007
2025-09-09 06:25:16 +00:00
4840a1a591 Run vLLM tests on all trunk commits before 2.9 branch cut (#161797)
This makes it easier to bisect issue now given that we don't have lots of time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161797
Approved by: https://github.com/yangw-dev
2025-09-09 05:56:41 +00:00
d49205fe1f Add more tests for vllm and clean out the old vllm test (#162292)
Test failure coverage from pytorch 2.8 release issues
[internal access only](https://docs.google.com/document/d/1zvK1eUAHubHGGHg9jKxd-QlP89fzgfqOBvE2m9mUs90/edit?tab=t.0
)

See coverage mapping
| Given test / pattern | Suite ID (from config) |
|---|---|
| pytest -v -s basic_correctness/test_cumem.py | vllm_basic_correctness_test |
| pytest -v -s entrypoints/openai/test_sleep.py | vllm_entrypoints_test |
| pytest -v -s entrypoints/openai/test_translation_validation.py::test_long_audio_request | vllm_entrypoints_test |
| pytest -v -s lora/test_quant_model.py | vllm_lora_28_failure_test |
| pytest -v -s -x tests/lora/test_llama_tp.py | vllm_lora_tp_test_distributed |
| pytest -v -s distributed/test_sequence_parallel.py -k test_tp_sp_generation |vllm_distributed_test_28_failure_test |
| pytest -v -s distributed/test_sequence_parallel.py::test_tp_sp_generation[...] | vllm_distributed_test_28_failure_test |
| pytest models/language/generation/test_mistral.py::test_models[...] | vllm_languagde_model_test_extended_generation_28_failure_test |
| pytest models/multimodal/pooling/test_jinavl_reranker.py::test_model_text_image[...] | vllm_multi_model_test_28_failure_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen25vl_lora | vllm_lora_test |
| tests/lora/test_qwen2vl.py::test_qwen2vl_lora_beam_search | vllm_lora_test |
| tests/lora/test_phi.py::test_phi2_lora | DIDN'T FIND IT IT IN VLLM |
| models/multimodal/generation/test_voxtral.py::test_models_with_multiple_audios[5-128-half] | vllm_multi_model_test_28_failure_test |
| models/test_initialization.py::test_can_initialize[VoxtralForConditionalGeneration] | vllm_basic_models_test |
| pytest -v -s -x lora/test_chatglm3_tp.py -k test_chatglm3_lora_tp4_fully_sharded_loras | vllm_lora_tp_test_distributed |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162292
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-09-09 05:53:46 +00:00
d85392a88e Add BundledAOTAutogradSerializableCallable (#162170)
This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now.

In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162170
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162169
2025-09-09 05:42:19 +00:00
7feb8fc589 [SymmMEM] Allow to import _SymmetricMemory when NVSHMEM is not available (#162142)
Summary:
As we have multiple backends, _SymmetricMemory should not be imported together with NVSHMEM related modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162142
Approved by: https://github.com/dcci, https://github.com/kwen2501
2025-09-09 05:37:43 +00:00
60d009267e Revert "testing infra and some fixes (#162183)"
This reverts commit d8b6622bb6a3879d3832ab6cdc26ff4188ea4a2d.

Reverted https://github.com/pytorch/pytorch/pull/162183 on behalf of https://github.com/huydhn due to Failing a test on macos ([comment](https://github.com/pytorch/pytorch/pull/162183#issuecomment-3268922096))
2025-09-09 05:26:32 +00:00
4590438329 [fx] fix qualified name for methods of torch.Tensor (#162407)
This fixes an error in the previous PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162407
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2025-09-09 05:14:43 +00:00
8494afb837 Add missing fstream include to fix std::ofstream compilation error (#162421)
## Summary
This PR adds a missing `#include <fstream>` to fix a compilation error that occurred with the clang compiler on the standard *Google internal compile setup* (built with bazel).

## Details
The `std::ofstream` type was implicitly instantiated, which can cause compilation to fail with certain compilers. In this case, the clang compiler within the Google internal compile setup failed with an implicit instantiation error of `std::basic_ofstream<char>`. By explicitly including the `<fstream>` header, this PR resolves the error and ensures proper compilation in a wider range of setups and compilers.

## Error message:
```
torch/csrc/distributed/c10d/FlightRecorder.cpp:8:17: error: implicit instantiation of undefined template 'std::basic_ofstream<char>'
8 | std::ofstream file(filename_, std::ios::binary);
| ^
libcxx/include/__fwd/fstream.h:26:7: note: template is declared here
26 | class basic_ofstream;
| ^
1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162421
Approved by: https://github.com/ezyang
2025-09-09 05:14:32 +00:00
7ad40de60e [audio hash update] update the pinned audio hash (#162437)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162437
Approved by: https://github.com/pytorchbot
2025-09-09 04:41:34 +00:00
607327beae [vllm hash update] update the pinned vllm hash (#162356)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162356
Approved by: https://github.com/pytorchbot
2025-09-09 04:40:25 +00:00
f216d64bfe [SymmMem] Better tuning of A2AV based on accurate node boundary (#162003)
Use `world_within_direct_access()` to distinguish intra- vs inter- node.
Previously we assumed a fixed node size of 8, which is not true for NVL72.

Also added env var `TORCH_SYMMMEM_NBLOCKS` for control.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162003
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-09-09 04:18:17 +00:00
847d7f21af [CUDA-13] Implement workaround for cudaErrorNotSupported (#162412)
See https://github.com/pytorch/pytorch/issues/162333#issuecomment-3267929585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162412
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-09 04:12:10 +00:00
065c446193 [SymmMem] Use global pe for put and get (#162394)
NVSHMEM put/get APIs take global PE instead of local counterpart. So we'd need to do a translation within the kernel.

Also added a sub-group test for dispatch and combine mimic'ing the Expert Parallel cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162394
Approved by: https://github.com/ngimel, https://github.com/fegin
ghstack dependencies: #162320
2025-09-09 03:58:48 +00:00
98ecc0f374 [SymmMem] Add team pool to hold duplicated teams for the same rank group (#162320)
When multiple threadblocks call device-side collectives concurrently, NVSHMEM requires each call being made on a separate team struct, see [Collective operations scopes and active sets](https://docs.nvidia.com/nvshmem/api/gen/api/collectives.html?highlight=nvshmem_barrier_all#collective-operations-scopes-and-active-sets).

This PR adds a util `get_n_teams` for creating duplicated nvshmem teams for the same rank group, i.e. team pool. So that we can use them on device side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162320
Approved by: https://github.com/ngimel
2025-09-09 03:58:48 +00:00
4c45090cf7 [DTensor] Check if tracing for sharding propagation to handle unhashable keys (#160798)
Fixes #159590

This is similar to the reverted commit #156868, except it resolves an issue with two caches becoming misaligned, leading to incorrect objects for stateful placements (i.e. `_MaskPartial`) as in issue #159601. This adds little to no overhead in eager ([see past benchmarks](https://github.com/pytorch/pytorch/pull/156868#issuecomment-3047831149)).

This also handles cases such as #159590  where dynamo is disabled during tracing by entering the Python Dispatcher ahead of the sharding propogation during compile. Tests are added/modified to handle these, and the list/tuple inputs with the cat op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160798
Approved by: https://github.com/bdhirsh
2025-09-09 03:52:05 +00:00
1641606aa4 Revert "Add BundledAOTAutogradSerializableCallable (#162170)"
This reverts commit 5babb4d5c04b1ff7ed5f96f7aea1898cd4faef5a.

Reverted https://github.com/pytorch/pytorch/pull/162170 on behalf of https://github.com/huydhn due to This PR has a merge conflict with D81793200 on aot_compile.py where PRs and diffs are landed in reverted order ([comment](https://github.com/pytorch/pytorch/pull/162170#issuecomment-3268735428))
2025-09-09 03:33:36 +00:00
7b8a64557d [inductor] fix 3d tiled online softmax (#162341)
The online_softmax_reduce runtime helper previously assumes the input tl.Tensor's are 2d tensors. But with tiled reduction, they can be 3d (y, x, r).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162341
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162311
2025-09-09 02:59:52 +00:00
d8b6622bb6 testing infra and some fixes (#162183)
This PR is quite large in that it covers most of rough edges in the new strict export flow:

1. Handle nn_module_stack correctly now that we are tracing wrapper module
2. module_call_spec needs to get queried from source directly because we are not running the bytecode anymore.
3. Correct input and output handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162183
Approved by: https://github.com/zhxchen17
ghstack dependencies: #162167
2025-09-09 02:42:11 +00:00
a965f09793 [export] Update PT2 archive docs (#162308)
Summary: Minor updates based on the recent refactoring for weight saving and loading

Test Plan:
doc change only

Rollback Plan:

Differential Revision: D81821994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162308
Approved by: https://github.com/angelayi
2025-09-09 02:08:13 +00:00
583bbf7761 [MPS] Add native_dropout and native_dropout_backward (#162108)
Fixes #162002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108
Approved by: https://github.com/malfet
2025-09-09 01:44:06 +00:00
e025c0f459 Dynamo: set_eval_frame microoptimization (#162220)
Optimize for common case and remove a pair of refcount operations (see new comments.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162220
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #161591, #161595, #161633, #161634, #161692, #162219
2025-09-09 01:10:06 +00:00
a8a187b2cf Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef (#162219)
Avoids requiring vector allocation to call this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162219
Approved by: https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634, #161692
2025-09-09 01:10:06 +00:00
12db2a7889 Call checkLong in is_int_or_symint, completing TODO (#161692)
Calling this first minimizes overhead for plain old ints, making cheap things cheap.

Differential Revision: [D81530098](https://our.internmc.facebook.com/intern/diff/D81530098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161692
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595, #161633, #161634
2025-09-09 01:10:06 +00:00
eab2afeff7 fastpath type Tensor in THPVariable_NewWithVar (#161634)
It is cheap to do an exact check against Tensor and much faster when it works (PyType_IsSubtype does not have this fastpath, I checked [source](9ee0214b5d/Objects/typeobject.c (L2889))). Spot-checked in perf on detach-DTensor-in-a-loop benchmark; small win but clear.

Differential Revision: [D81530101](https://our.internmc.facebook.com/intern/diff/D81530101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161634
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #161591, #161595, #161633
2025-09-09 01:10:06 +00:00
a951f435fd Avoid redundant PyTuple_GetSize call in _maybe_handle_torch_function (#161633)
py::args::size() calls PyTuple_GetSize. Compiler can't know the two calls will always return the same result, so we have to consolidate them ourselves.

Differential Revision: [D81530096](https://our.internmc.facebook.com/intern/diff/D81530096)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161633
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #161591, #161595
2025-09-09 01:10:06 +00:00
6eb14ac60f [Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447)
This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior.

Fixes: #140457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447
Approved by: https://github.com/mlazos
2025-09-09 01:07:02 +00:00
ed77e23b68 Revert "[dynamo] Constant fold torch.autograd._profiler_enabled (#158482)"
This reverts commit d7e1b8b11d7430c7633dcad6f6596b5df8fa02f7.

Reverted https://github.com/pytorch/pytorch/pull/158482 on behalf of https://github.com/borgstrom due to NCCL hangs in S560336 ([comment](https://github.com/pytorch/pytorch/pull/158482#issuecomment-3268426781))
2025-09-09 00:21:05 +00:00
897c4e70a7 Move to small wheel approach for CUDA SBSA wheel (#160720)
https://github.com/pytorch/pytorch/issues/160673

Use download.pytorch.org's dependencies like x86 build instead of bundling libs into the wheel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160720
Approved by: https://github.com/atalman
2025-09-09 00:18:43 +00:00
8485aac873 [precompile] Fix inlined source tracking with generators. (#162389)
Summary:
When compiled code has generator, code.co_firstlineno will be inconsistent with the result from inspect.getsource, which returns the toplevel enclosing code source rather than the inner code location.

In this case, it seems simpler to just use the toplevel enclosing code location rather than the co_firstlineno field.

Test Plan:
test_package.py -k test_code_with_generator

Rollback Plan:

Differential Revision: D81929751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162389
Approved by: https://github.com/dolpm, https://github.com/hrithick-codes
2025-09-09 00:13:54 +00:00
c0fc86b511 Fix aarch64 wheel pack (#159481)
PR that introduced the change: https://github.com/pytorch/builder/pull/1775
Use wheel pack instead of zip to repack the wheel.
It should regenerate the RECORD file and update all the hashes correctly.

TODO:
Apply wheel pack instead of zip to Rest of builds
Add validation test to make sure wheel contents matches RECORD file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159481
Approved by: https://github.com/malfet
2025-09-08 23:36:50 +00:00
07f07309c6 [associative_scan] Autograd separated (#139939)
This PR implements the Autograd feature of the associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139939
Approved by: https://github.com/huydhn
2025-09-08 23:30:11 +00:00
189a054cfb Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] (#160869)
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Approved by: https://github.com/ezyang

Test Plan:
contbuild & OSS CI, see e444cd24d4

Rollback Plan:

Differential Revision: D80435179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
5fd6b6a2db [refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189)
## Summary
- document guard behavior in `SizeVarAllocator.is_size_one`
- use `is_size_one` for broadcast/expand checks.
- This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)`

a4f9132a17/torch/_inductor/sizevars.py (L450-L453)

------
https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189
Approved by: https://github.com/laithsakka
2025-09-08 22:48:16 +00:00
ac9ccd0dc2 Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-08 22:44:48 +00:00
711c8c821e shape guards (#161178)
Summary: This PR introduces shape guards to export. Previously only value ranges,  equalities, and specializations would be tracked for symbolic expressions, and we had a forward hook to check them. Instead now we create a function to check shape guards and call it in the exported program.

Test Plan:
updated several tests

Rollback Plan:

Differential Revision: D80713603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161178
Approved by: https://github.com/tugsbayasgalan
2025-09-08 22:44:09 +00:00
2c538c9acf rewrite __maybe_broadcast should_expand check for unbacked (#162109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162109
Approved by: https://github.com/aorenste
ghstack dependencies: #162084, #162099
2025-09-08 22:41:18 +00:00
85fe94e933 make should_swap more dde friendly (#162099)
unblock customers for common cases with DDE ,until @pianpwk  land the change to should_swap https://github.com/pytorch/pytorch/pull/160473.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162099
Approved by: https://github.com/aorenste
ghstack dependencies: #162084
2025-09-08 22:41:18 +00:00
fecd9686f5 Graph split event tracker (#159795)
Summary:
A tool to track events in graph split, specifically on how nodes being end up in acc or cpu subgraphs.

Usage: use env var to specify a mode and necessary arguments.

FX_NET_ACC_SPLITTER_TRACKER_MODE: Tracker mode.
```
Different modes of the event tracker:
"0": Tracker not enabled (by default)
"1": Tracker enabled but no dumps. Information available by setting breakpoints and visually inspect in pdb.
"2": Tracker enabled and dumps all events to DUMP_PREFIX_all.txt
"3": In addition to events dump, track nodes specified by ENV_FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES recusrively and dump to DUMP_PREFIX_nodex.txt
"4:: In addition to events dump, track all nodes with more than 1 event recusrively and dump to DUMP_PREFIX_nodex.txt
```
FX_NET_ACC_SPLITTER_TRACKER_DUMP_PATH: overriding dump path. Leave empty for `~`.
FX_NET_ACC_SPLITTER_TRACKER_TRACKED_NODES: Nodes to track for mode "3".

Test Plan: New unit test

Reviewed By: georgiaphillips

Differential Revision: D79203595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159795
Approved by: https://github.com/ezyang
2025-09-08 21:30:17 +00:00
dd44faa9d9 Revert "Modify ROCm MI2xx-based workflows to run on cron schedule (#162103)"
This reverts commit 0af70e2353e1dcda83175fd4834ecb7b63e009e0.

Reverted https://github.com/pytorch/pytorch/pull/162103 on behalf of https://github.com/jithunnair-amd due to Cirrascale network outage resolved. Reverting back to running per commit to aid in triage and CI health ([comment](https://github.com/pytorch/pytorch/pull/162103#issuecomment-3267977825))
2025-09-08 20:53:05 +00:00
5d819f3faf Revert "[associative_scan] Autograd separated (#139939)"
This reverts commit 103f725afa8dbf0204a1be6a042ab93aa16d85d8.

Reverted https://github.com/pytorch/pytorch/pull/139939 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing a weird failure after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/139939#issuecomment-3267945657))
2025-09-08 20:42:47 +00:00
015423bef8 Add fp16-overflow regression test (#162401)
Discovered while debugging https://github.com/pytorch/pytorch/issues/160841 where sdpa returned NaNs, because during the computation intermediate values were cast back to fp16 before normalization, which was fixed by https://github.com/pytorch/pytorch/pull/161999 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162401
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-09-08 20:33:23 +00:00
26a1b9cce2 [dynamo] fix resume_execution.py KeyError in Python 3.11+ (#162318)
Fixes https://github.com/pytorch/pytorch/issues/162313

Differential Revision: [D81938289](https://our.internmc.facebook.com/intern/diff/D81938289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162318
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/anijain2305
2025-09-08 20:26:24 +00:00
8f114650eb Add std::any_of to ConvParams struct (#162334)
Removes some for-loops that didn't short-circuit in favor of std::any_of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162334
Approved by: https://github.com/Skylion007
2025-09-08 20:12:20 +00:00
ec2c1371af [BE]: Update cudnn frontend submodule to 1.14.1 (#162347)
Fixes a few bugs introduced to CUDNN 1.11 which affects all our CUDA13 builds. Also adds support for new CUDNN features whenever we choose to update. @eqy pretty sure this addresses the concern you had over the previous upgrade since that bugfix is now merged. This is a simple header only update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162347
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-08 20:03:23 +00:00
8ec01f34e9 [bucketing] custom_ops mode to hide inductor copies overhead (#161499)
Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each,
to workaround too many generated kernels on inductor side.

This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems.

Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499
Approved by: https://github.com/eellison
2025-09-08 20:03:08 +00:00
9c991b63ff [CD] [aarch64] Add CUDA 12.6 and 12.8 to build matrix, remove 12.9 build (#162364)
https://github.com/pytorch/pytorch/issues/159779

Add the full CUDA support matrix to sbsa build (12.6, 12.8)
Same arch support as x86 build
Remove 12.9 sbsa build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162364
Approved by: https://github.com/atalman
2025-09-08 20:00:25 +00:00
4e50651c5f [DTensor] fix F.one_hot (#162307)
F.one_hot(dtensor) used to run into a mixed DTensor-Tensor operation due
to an arange call creating a new Tensor (not DTensor). This PR fixes it
by allowing implicit replication of Tensors for the arange call and the
one consumer of the arange call (the at::eq call).

Test Plan:
- new test. Also, F.one_hot(num_classes=-1) is broken so we skip that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162307
Approved by: https://github.com/ezyang
ghstack dependencies: #162117
2025-09-08 19:37:08 +00:00
a0d026688c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-08 19:10:36 +00:00
d80297a684 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-08 19:10:36 +00:00
fbcabb4fbd Handle f([]) vs. f() in fake tensor caching (#162284)
Fixes https://github.com/pytorch/pytorch/issues/162279
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162284
Approved by: https://github.com/manuelcandales, https://github.com/aorenste
2025-09-08 18:28:05 +00:00
314d47a210 [audio hash update] update the pinned audio hash (#162315)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162315
Approved by: https://github.com/pytorchbot
2025-09-08 18:26:33 +00:00
bc4176c92a CD Windows CUDA 13.0 build - fix packaging of cuda dlls (#162383)
Trying to fix https://github.com/pytorch/pytorch/issues/162333

CUDA 13.0 file structure changed. Instead of keeping most of dlls in bin folder its now in ``bin\x64`` except for cudnn dll. See attached picture :
<img width="511" height="361" alt="Screenshot 2025-09-08 at 9 46 26 AM" src="https://github.com/user-attachments/assets/d2e630ee-930f-4da6-9b81-f9ef48fde7ce" />
<img width="490" height="333" alt="Screenshot 2025-09-08 at 9 46 34 AM" src="https://github.com/user-attachments/assets/194cbf43-b6ef-4218-b516-db37b91302be" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162383
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/malfet
2025-09-08 17:57:22 +00:00
eqy
de5dc1f038 [cuDNN][SDPA][Nested Tensor] add forward/backward caching support for cuDNN SDPA Nested tensor/varlen (#161434)
Don't recompile every time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161434
Approved by: https://github.com/drisspg
2025-09-08 17:51:13 +00:00
72e6717d00 Avoid crash with release_available_cached_blocks (#162269)
updated release behavior for cached blocks
Fixes #159567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162269
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-09-08 17:46:43 +00:00
ebd29a13fe [inductor] fuse for scalar shared data (#162311)
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162311
Approved by: https://github.com/jansel
2025-09-08 17:20:46 +00:00
5793dd7875 [Intel GPU] Integrate OneDNN SDPA training forward and backward (#161058)
This PR is the first split PR of https://github.com/pytorch/pytorch/pull/156272, only contains the OneDNN code. Please help review.

Pending on OneDNN v3.9 commit update. Don't merge.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161058
Approved by: https://github.com/guangyey, https://github.com/EikanWang
2025-09-08 17:07:31 +00:00
985 changed files with 20086 additions and 5950 deletions

View File

@ -3,12 +3,13 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
# Set CUDA architecture lists to match x86 build_cuda.sh
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
fi
if [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi
# Compress the fatbin with -compress-mode=size for CUDA 13
@ -27,7 +28,7 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
@ -35,6 +36,16 @@ if [ "$DESIRED_CUDA" = "cpu" ]; then
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
# Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling CUDA libraries with wheel for aarch64."
else
echo "Using nvidia libs from pypi for aarch64."
echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"
export USE_NVIDIA_PYPI_LIBS=1
fi
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -69,83 +69,186 @@ def replace_tag(filename) -> None:
f.writelines(lines)
def patch_library_rpath(
folder: str,
lib_name: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Apply patchelf to set RPATH for a library in torch/lib"""
lib_path = f"{folder}/tmp/torch/lib/{lib_name}"
if use_nvidia_pypi_libs:
# For PyPI NVIDIA libraries, construct CUDA RPATH
cuda_rpaths = [
"$ORIGIN/../../nvidia/cudnn/lib",
"$ORIGIN/../../nvidia/nvshmem/lib",
"$ORIGIN/../../nvidia/nccl/lib",
"$ORIGIN/../../nvidia/cusparselt/lib",
]
if "130" in desired_cuda:
cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")
else:
cuda_rpaths.extend(
[
"$ORIGIN/../../nvidia/cublas/lib",
"$ORIGIN/../../nvidia/cuda_cupti/lib",
"$ORIGIN/../../nvidia/cuda_nvrtc/lib",
"$ORIGIN/../../nvidia/cuda_runtime/lib",
"$ORIGIN/../../nvidia/cufft/lib",
"$ORIGIN/../../nvidia/curand/lib",
"$ORIGIN/../../nvidia/cusolver/lib",
"$ORIGIN/../../nvidia/cusparse/lib",
"$ORIGIN/../../nvidia/nvtx/lib",
"$ORIGIN/../../nvidia/cufile/lib",
]
)
# Add $ORIGIN for local torch libs
rpath = ":".join(cuda_rpaths) + ":$ORIGIN"
else:
# For bundled libraries, just use $ORIGIN
rpath = "$ORIGIN"
if os.path.exists(lib_path):
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"
)
def copy_and_patch_library(
src_path: str,
folder: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Copy a library to torch/lib and patch its RPATH"""
if os.path.exists(src_path):
lib_name = os.path.basename(src_path)
shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# Delete original wheel since it will be repackaged
os.system(f"rm {wheel_path}")
# CUDA version-specific libraries
if "130" in desired_cuda:
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.0",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
# Check if we should use PyPI NVIDIA libraries or bundle system libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Using nvidia libs from pypi - skipping CUDA library bundling")
# For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages
# We only need to bundle non-NVIDIA libraries
minimal_libs_to_copy = [
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy minimal libraries to unzipped_folder/torch/lib
for lib_path in minimal_libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
# Patch torch libraries used for searching libraries
torch_libs_to_patch = [
"libtorch.so",
"libtorch_cpu.so",
"libtorch_cuda.so",
"libtorch_cuda_linalg.so",
"libtorch_global_deps.so",
"libtorch_python.so",
"libtorch_nvshmem.so",
"libc10.so",
"libc10_cuda.so",
"libcaffe2_nvrtc.so",
"libshm.so",
]
for lib_name in torch_libs_to_patch:
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
else:
print("Bundling CUDA libraries with wheel")
# Original logic for bundling system CUDA libraries
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# CUDA version-specific libraries
if "13" in desired_cuda:
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
]
else:
raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy libraries to unzipped_folder/torch/lib
for lib_path in libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
@ -153,14 +256,8 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
replace_tag(f"{f.path}/WHEEL")
break
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
f"{folder}/cuda_wheel/{wheelname}",
f"{folder}/{wheelname}",
copy_function=shutil.copy2,
)
os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")
os.system(f"wheel pack {folder}/tmp/ -d {folder}")
os.system(f"rm -rf {folder}/tmp/")
def complete_wheel(folder: str) -> str:
@ -183,14 +280,7 @@ def complete_wheel(folder: str) -> str:
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = wheel_name.replace(
"linux_aarch64", "manylinux_2_28_aarch64"
)
print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")
os.rename(
f"/{folder}/dist/{wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
repaired_wheel_name = list_dir(f"/{folder}/dist")[0]
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
@ -232,6 +322,16 @@ if __name__ == "__main__":
if enable_cuda:
build_vars += "MAX_JOBS=5 "
# Handle PyPI NVIDIA libraries vs bundled libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Configuring build for PyPI NVIDIA libraries")
# Configure for dynamic linking (matching x86 logic)
build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "
else:
print("Configuring build for bundled NVIDIA libraries")
# Keep existing static linking approach - already configured above
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
if override_package_version is not None:

View File

@ -214,8 +214,7 @@ case "$tag" in
TRITON=yes
;;
pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)
# TODO (huydhn): Upgrade this to Python >= 3.10
ANACONDA_PYTHON_VERSION=3.9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
KATEX=yes

View File

@ -56,9 +56,13 @@ ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
RUN mkdir ci_commit_pins
COPY ./common/common_utils.sh common_utils.sh
COPY ./ci_commit_pins/rocm-composable-kernel.txt ci_commit_pins/rocm-composable-kernel.txt
COPY ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
RUN rm install_rocm.sh common_utils.sh
RUN rm -r ci_commit_pins
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}
RUN rm install_rocm_magma.sh

View File

@ -0,0 +1 @@
7fe50dc3da2069d6645d9deb8c017a876472a977

View File

@ -1 +1 @@
fccfc522864cf8bc172abe0cd58ae5581e2d44b9
5ae38bdb0dc066c5823e34dc9797afb9de42c866

View File

@ -2,6 +2,11 @@
set -ex
# for pip_install function
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
ROCM_COMPOSABLE_KERNEL_VERSION="$(cat $(dirname $0)/../ci_commit_pins/rocm-composable-kernel.txt)"
ver() {
printf "%3d%03d%03d%03d" $(echo "$1" | tr '.' ' ');
}
@ -113,6 +118,8 @@ EOF
rm -rf HIP clr
fi
pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@ -176,6 +183,8 @@ install_centos() {
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
pip_install "git+https://github.com/rocm/composable_kernel@$ROCM_COMPOSABLE_KERNEL_VERSION"
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -52,9 +52,13 @@ ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
RUN mkdir ci_commit_pins
COPY ./common/common_utils.sh common_utils.sh
COPY ./ci_commit_pins/rocm-composable-kernel.txt ci_commit_pins/rocm-composable-kernel.txt
COPY ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
RUN rm install_rocm.sh common_utils.sh
RUN rm -r ci_commit_pins
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh ${ROCM_VERSION}
RUN rm install_rocm_magma.sh

View File

@ -7,4 +7,4 @@ set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh
USE_NVSHMEM=0 USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.10" ${SCRIPTPATH}/../manywheel/build.sh

View File

@ -96,14 +96,24 @@ def sample_vllm_test_library():
"num_gpus": 4,
"steps": [
"pytest -v -s -x lora/test_chatglm3_tp.py",
"echo $VLLM_WORKER_MULTIPROC_METHOD",
"pytest -v -s -x lora/test_llama_tp.py",
"pytest -v -s -x lora/test_multi_loras_with_tp.py",
"pytest -v -s -x lora/test_llm_with_multi_loras.py",
],
},
"vllm_lora_280_failure_test": {
"title": "LoRA 280 failure test",
"id": "vllm_lora_280_failure_test",
"vllm_distributed_test_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
"vllm_lora_28_failure_test": {
"title": "LoRA pytorch 2.8 failure test",
"id": "vllm_lora_28_failure_test",
"steps": ["pytest -v lora/test_quant_model.py"],
},
"vllm_multi_model_processor_test": {
@ -114,6 +124,15 @@ def sample_vllm_test_library():
"pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py",
],
},
"vllm_multi_model_test_28_failure_test": {
"title": "Multi-Model Test (Failed 2.8 release)",
"id": "vllm_multi_model_test_28_failure_test",
"package_install": ["git+https://github.com/TIGER-AI-Lab/Mantis.git"],
"steps": [
"pytest -v -s models/multimodal/generation/test_voxtral.py",
"pytest -v -s models/multimodal/pooling",
],
},
"vllm_pytorch_compilation_unit_tests": {
"title": "PyTorch Compilation Unit Tests",
"id": "vllm_pytorch_compilation_unit_tests",
@ -128,6 +147,28 @@ def sample_vllm_test_library():
"pytest -v -s compile/test_decorator.py",
],
},
"vllm_languagde_model_test_extended_generation_28_failure_test": {
"title": "Language Models Test (Extended Generation) 2.8 release failure",
"id": "vllm_languagde_model_test_extended_generation_28_failure_test",
"package_install": [
"--no-build-isolation",
"git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8",
],
"steps": [
"pytest -v -s models/language/generation/test_mistral.py",
],
},
"vllm_distributed_test_2_gpu_28_failure_test": {
"title": "Distributed Tests (2 GPUs) pytorch 2.8 release failure",
"id": "vllm_distributed_test_2_gpu_28_failure_test",
"env_vars": {
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",
},
"num_gpus": 4,
"steps": [
"pytest -v -s distributed/test_sequence_parallel.py",
],
},
# TODO(elainewy):need to add g6 with 4 gpus to run this test
"vllm_lora_test": {
"title": "LoRA Test %N",

View File

@ -66,6 +66,11 @@ class VllmBuildParameters:
"DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"
)
# the cleaning script to remove torch dependencies from pip
cleaning_script: Path = env_path_field(
"cleaning_script", ".github/ci_configs/vllm/use_existing_torch.py"
)
# OUTPUT_DIR: where docker buildx (local exporter) will write artifacts
output_dir: Path = env_path_field("OUTPUT_DIR", "external/vllm")
@ -160,6 +165,7 @@ class VllmBuildRunner(BaseRunner):
logger.info("Running vllm build with inputs: %s", inputs)
vllm_commit = clone_vllm()
self.cp_torch_cleaning_script(inputs)
self.cp_dockerfile_if_exist(inputs)
# cp torch wheels from root direct to vllm workspace if exist
self.cp_torch_whls_if_exist(inputs)
@ -205,6 +211,11 @@ class VllmBuildRunner(BaseRunner):
copy(inputs.torch_whls_path, tmp_dir)
return tmp_dir
def cp_torch_cleaning_script(self, inputs: VllmBuildParameters):
script = get_path(inputs.cleaning_script, resolve=True)
vllm_script = Path(f"./{self.work_directory}/use_existing_torch.py")
copy(script, vllm_script)
def cp_dockerfile_if_exist(self, inputs: VllmBuildParameters):
if not inputs.use_local_dockerfile:
logger.info("using vllm default dockerfile.torch_nightly for build")

View File

@ -11,7 +11,7 @@ from typing import Any
from cli.lib.common.cli_helper import BaseRunner
from cli.lib.common.envs_helper import env_path_field, env_str_field, get_env
from cli.lib.common.path_helper import copy, remove_dir
from cli.lib.common.path_helper import copy, get_path, remove_dir
from cli.lib.common.pip_helper import (
pip_install_first_match,
pip_install_packages,
@ -43,6 +43,10 @@ class VllmTestParameters:
torch_cuda_arch_list: str = env_str_field("TORCH_CUDA_ARCH_LIST", "8.9")
cleaning_script: Path = env_path_field(
"cleaning_script", ".github/ci_configs/vllm/use_existing_torch.py"
)
def __post_init__(self):
if not self.torch_whls_path.exists():
raise ValueError("missing torch_whls_path")
@ -92,11 +96,13 @@ class VllmTestRunner(BaseRunner):
self._set_envs(params)
clone_vllm(dst=self.work_directory)
self.cp_torch_cleaning_script(params)
with working_directory(self.work_directory):
remove_dir(Path("vllm"))
self._install_wheels(params)
self._install_dependencies()
# verify the torches are not overridden by test dependencies
check_versions()
def run(self):
@ -104,20 +110,31 @@ class VllmTestRunner(BaseRunner):
main function to run vllm test
"""
self.prepare()
with working_directory(self.work_directory):
if self.test_type == TestInpuType.TEST_PLAN:
if self.num_shards > 1:
run_test_plan(
self.test_plan,
"vllm",
sample_vllm_test_library(),
self.shard_id,
self.num_shards,
)
try:
with working_directory(self.work_directory):
if self.test_type == TestInpuType.TEST_PLAN:
if self.num_shards > 1:
run_test_plan(
self.test_plan,
"vllm",
sample_vllm_test_library(),
self.shard_id,
self.num_shards,
)
else:
run_test_plan(
self.test_plan, "vllm", sample_vllm_test_library()
)
else:
run_test_plan(self.test_plan, "vllm", sample_vllm_test_library())
else:
raise ValueError(f"Unknown test type {self.test_type}")
raise ValueError(f"Unknown test type {self.test_type}")
finally:
# double check the torches are not overridden by other packages
check_versions()
def cp_torch_cleaning_script(self, params: VllmTestParameters):
script = get_path(params.cleaning_script, resolve=True)
vllm_script = Path(f"./{self.work_directory}/use_existing_torch.py")
copy(script, vllm_script)
def _install_wheels(self, params: VllmTestParameters):
logger.info("Running vllm test with inputs: %s", params)

View File

@ -258,11 +258,19 @@ function install_torchrec_and_fbgemm() {
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}" --recurse-submodules
python setup.py bdist_wheel \
--build-variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
# until the fbgemm_commit includes the tbb patch
patch <<'EOF'
--- a/FbgemmGpu.cmake
+++ b/FbgemmGpu.cmake
@@ -184,5 +184,6 @@ gpu_cpp_library(
fbgemm_gpu_tbe_cache
fbgemm_gpu_tbe_optimizers
fbgemm_gpu_tbe_utils
+ tbb
DESTINATION
fbgemm_gpu)
EOF
python setup.py bdist_wheel --build-variant=rocm
popd
# Save the wheel before cleaning up

View File

@ -35,11 +35,10 @@ fi
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
else
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
# NB: we always build with distributed; USE_DISTRIBUTED turns off all
# backends (specifically the gloo backend), so test that this case works too
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
fi
if which sccache > /dev/null; then

View File

@ -13,9 +13,13 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
python -mpip install -r requirements.txt
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
python -mpip install --no-input -r requirements.txt
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to
@ -55,7 +59,7 @@ test_python_shard() {
setup_test_python
time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS"
time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "$1" "$NUM_TEST_SHARDS"
assert_git_not_dirty
}

View File

@ -386,8 +386,8 @@ def smoke_test_compile(device: str = "cpu") -> None:
def smoke_test_nvshmem() -> None:
if not torch.cuda.is_available():
print("CUDA is not available, skipping NVSHMEM test")
if not torch.cuda.is_available() or target_os == "windows":
print("Windows platform or CUDA is not available, skipping NVSHMEM test")
return
# Check if NVSHMEM is compiled in current build
@ -396,7 +396,9 @@ def smoke_test_nvshmem() -> None:
except ImportError:
# Not built with NVSHMEM support.
# torch is not compiled with NVSHMEM prior to 2.9
if torch.__version__ < "2.9":
from torch.torch_version import TorchVersion
if TorchVersion(torch.__version__) < (2, 9):
return
else:
# After 2.9: NVSHMEM is expected to be compiled in current build

View File

@ -312,14 +312,14 @@ test_python_shard() {
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_python() {
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
@ -374,6 +374,7 @@ test_dynamo_wrapped_shard() {
--exclude-distributed-tests \
--exclude-torch-export-tests \
--exclude-aot-dispatch-tests \
--exclude-quantization-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose \
--upload-artifacts-while-running
@ -1146,6 +1147,12 @@ test_distributed() {
fi
}
test_quantization() {
echo "Testing quantization"
python test/test_quantization.py
}
test_rpc() {
echo "Testing RPC C++ tests"
# NB: the ending test_rpc must match the current function name for the current
@ -1646,6 +1653,8 @@ elif [[ "${TEST_CONFIG}" == *executorch* ]]; then
test_executorch
elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
test_python_legacy_jit
elif [[ "$TEST_CONFIG" == 'quantization' ]]; then
test_quantization
elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
# TODO: run some C++ tests
echo "no-op at the moment"
@ -1721,11 +1730,6 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *einops* ]]; then
test_einops
elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then

View File

@ -25,7 +25,7 @@ echo Copying over test times file
robocopy /E "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.additional_ci_files" "%PROJECT_DIR_WIN%\.additional_ci_files"
echo Run nn tests
python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
python run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
if ERRORLEVEL 1 goto fail
popd

View File

@ -1,12 +1,20 @@
copy "%CUDA_PATH%\bin\cusparse*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cublas*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudart*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\curand*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cufft*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib
if %CUDA_VERSION% geq 130 (
set "dll_path=bin\x64"
) else (
set "dll_path=bin"
)
copy "%CUDA_PATH%\%dll_path%\cusparse*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cublas*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cudart*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\curand*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cufft*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\cusolver*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\%dll_path%\nvJitLink_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib
@ -20,8 +28,3 @@ copy "%libuv_ROOT%\bin\uv.dll" pytorch\torch\lib
if exist "C:\Windows\System32\zlibwapi.dll" (
copy "C:\Windows\System32\zlibwapi.dll" pytorch\torch\lib
)
::copy nvJitLink dll is requires for cuda 12+
if exist "%CUDA_PATH%\bin\nvJitLink_*.dll*" (
copy "%CUDA_PATH%\bin\nvJitLink_*.dll*" pytorch\torch\lib
)

View File

@ -1,9 +1,9 @@
set WIN_DRIVER_VN=528.89
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe" & REM @lint-ignore
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe
set WIN_DRIVER_VN=580.88
set "DRIVER_DOWNLOAD_LINK=https://ossci-windows.s3.amazonaws.com/%WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe" & REM @lint-ignore
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe
if errorlevel 1 exit /b 1
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe -s -noreboot
start /wait %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe -s -noreboot
if errorlevel 1 exit /b 1
del %WIN_DRIVER_VN%-data-center-tesla-desktop-winserver-2016-2019-2022-dch-international.exe || ver > NUL
del %WIN_DRIVER_VN%-data-center-tesla-desktop-win10-win11-64bit-dch-international.exe || ver > NUL

View File

@ -85,7 +85,7 @@ mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
# Create an isolated directory to store this builds pytorch checkout and conda
# installation
if [[ -z "$MAC_PACKAGE_WORK_DIR" ]]; then
MAC_PACKAGE_WORK_DIR="$(pwd)/tmp_wheel_conda_${DESIRED_PYTHON}_$(date +%H%M%S)"
MAC_PACKAGE_WORK_DIR="$(pwd)/tmp_wheel_${DESIRED_PYTHON}_$(date +%H%M%S)"
fi
mkdir -p "$MAC_PACKAGE_WORK_DIR" || true
if [[ -n ${GITHUB_ACTIONS} ]]; then
@ -96,11 +96,11 @@ fi
whl_tmp_dir="${MAC_PACKAGE_WORK_DIR}/dist"
mkdir -p "$whl_tmp_dir"
mac_version='macosx_11_0_arm64'
mac_version='macosx-11_0-arm64'
libtorch_arch='arm64'
# Create a consistent wheel package name to rename the wheel to
wheel_filename_new="${TORCH_PACKAGE_NAME}-${build_version}${build_number_prefix}-cp${python_nodot}-none-${mac_version}.whl"
wheel_filename_new="${TORCH_PACKAGE_NAME}-${build_version}${build_number_prefix}-cp${python_nodot}-none-${mac_version//[-,]/_}.whl"
###########################################################
@ -125,7 +125,6 @@ popd
export TH_BINARY_BUILD=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export MACOSX_DEPLOYMENT_TARGET=11.0
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
@ -133,25 +132,19 @@ RENAME_WHEEL=true
case $desired_python in
3.14t)
echo "Using 3.14 deps"
mac_version='macosx-11.0-arm64'
NUMPY_PINNED_VERSION="==2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
RENAME_WHEEL=false
;;
3.14)
echo "Using 3.14t deps"
mac_version='macosx-11.0-arm64'
NUMPY_PINNED_VERSION="==2.1.0"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
RENAME_WHEEL=false
;;
3.13t)
echo "Using 3.13 deps"
NUMPY_PINNED_VERSION="==2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
RENAME_WHEEL=false
;;
3.13)
@ -176,20 +169,16 @@ case $desired_python in
;;
esac
# Install into a fresh env
tmp_env_name="wheel_py$python_nodot"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}
source activate "$tmp_env_name"
PINNED_PACKAGES=(
"numpy${NUMPY_PINNED_VERSION}"
)
retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements-build.txt"
pip install requests ninja typing-extensions
retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
python -mvenv ~/${desired_python}-build
source ~/${desired_python}-build/bin/activate
retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements.txt"
retry brew install libomp
# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule
# For USE_DISTRIBUTED=1 on macOS, this enables gloo, which needs libuv, which
# is build as part of tensorpipe submodule
export USE_DISTRIBUTED=1
export USE_MKLDNN=OFF
@ -199,7 +188,7 @@ export BUILD_TEST=OFF
pushd "$pytorch_rootdir"
echo "Calling setup.py bdist_wheel at $(date)"
python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name ${mac_version}
_PYTHON_HOST_PLATFORM=${mac_version} ARCHFLAGS="-arch arm64" python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name "${mac_version//[-.]/_}"
echo "Finished setup.py bdist_wheel at $(date)"

View File

@ -73,7 +73,7 @@ exclude =
./docs/src,
./functorch/docs,
./functorch/examples,
./functorch/notebooks,
./functorch/docs/source/tutorials,
./scripts,
./test/generated_type_hints_smoketest.py,
./third_party,

View File

@ -21,6 +21,7 @@ self-hosted-runner:
- linux.arm64.2xlarge.ephemeral
- linux.arm64.m7g.4xlarge
- linux.arm64.m7g.4xlarge.ephemeral
- linux.arm64.r7g.12xlarge.memory
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu

View File

@ -1 +1 @@
2e300559e4e123928a22187b8f59a5b56f57ddc8
87ff22e49ed0e92576c4935ccb8c143daac4a3cd

View File

@ -1 +1 @@
7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8
08ae0af1395c8d8471f4025deb6af9aef90b342f

View File

@ -1 +1 @@
4172235ab78b09989fb56edaf734dbee283dda3e
973c9d01da863cac9c51e8a5c0d390fc84b84fbc

View File

@ -1 +1 @@
6c5478ff7c3d50dd1e3047d72ec5909bea474073
c77852e117bdf056c8e9a087e51d6f65cf6ba53d

View File

@ -82,16 +82,10 @@ RUN if command -v apt-get >/dev/null; then \
apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl wget sudo vim; \
else \
dnf install -y git curl wget sudo vim; \
dnf install -y git curl wget sudo; \
fi \
&& python3 --version && python3 -m pip --version
# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
if ! python3 -m uv --version >/dev/null 2>&1; then \
@ -220,11 +214,16 @@ ARG SCCACHE_S3_NO_CREDENTIALS=0
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=.git,target=.git \
if [ "$USE_SCCACHE" = "1" ]; then \
echo "Installing sccache..." \
&& curl -L -o sccache.tar.gz https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-unknown-linux-musl.tar.gz \
echo "Installing sccache..."; \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
SCCACHE_ARCHIVE="sccache-v0.8.1-aarch64-unknown-linux-musl"; \
else \
SCCACHE_ARCHIVE="sccache-v0.8.1-x86_64-unknown-linux-musl"; \
fi; \
curl -L -o sccache.tar.gz "https://github.com/mozilla/sccache/releases/download/v0.8.1/${SCCACHE_ARCHIVE}.tar.gz" \
&& tar -xzf sccache.tar.gz \
&& sudo mv sccache-v0.8.1-x86_64-unknown-linux-musl/sccache /usr/bin/sccache \
&& rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \
&& sudo mv "${SCCACHE_ARCHIVE}"/sccache /usr/bin/sccache \
&& rm -rf sccache.tar.gz "${SCCACHE_ARCHIVE}" \
&& export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
&& export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
@ -285,7 +284,7 @@ RUN if command -v apt-get >/dev/null; then \
&& ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \
&& curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION}; \
else \
dnf install -y git curl wget sudo vim; \
dnf install -y git curl wget sudo; \
fi \
&& python3 --version && python3 -m pip --version
@ -298,12 +297,6 @@ RUN echo "[INFO] Listing current directory before torch install step:" && \
echo "[INFO] Showing torch_build_versions.txt content:" && \
cat torch_build_versions.txt
# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
if ! python3 -m uv --version > /dev/null 2>&1; then \

View File

@ -0,0 +1,17 @@
import glob
requires_files = glob.glob("requirements/*.txt")
requires_files += ["pyproject.toml"]
for file in requires_files:
print(f">>> cleaning {file}")
with open(file) as f:
lines = f.readlines()
if "torch" in "".join(lines).lower():
print("removed:")
with open(file, "w") as f:
for line in lines:
if "torch" not in line.lower():
f.write(line)
print(f"<<< done cleaning {file}")
print()

View File

@ -19,6 +19,7 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/periodic-rocm-mi300
- ciflow/quantization-periodic
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390

View File

@ -15,7 +15,7 @@ optree==0.13.0
packaging==23.1
parameterized==0.8.1
pillow==10.3.0
protobuf==5.29.4
protobuf==5.29.5
psutil==5.9.8
pygments==2.15.0
pytest-cpp==2.3.0
@ -26,7 +26,7 @@ pytest-xdist==3.3.1
pytest==7.3.2
pyyaml==6.0.2
scipy==1.12.0
setuptools==72.1.0
setuptools==78.1.1
sympy==1.13.3
tlparse==0.4.0
tensorboard==2.13.0

View File

@ -39,7 +39,9 @@ def main() -> None:
pull_request_label_names = [label.name for label in pull_request_labels]
issue_label_names = [label.name for label in issue_labels]
labels_to_add = [
label for label in issue_label_names if label not in pull_request_label_names
label
for label in issue_label_names
if label not in pull_request_label_names and label != "actionable"
]
if not labels_to_add:
print("The pull request already has the same labels.")

View File

@ -38,60 +38,60 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCHES = ["13.0-aarch64"]
CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64", "13.0-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"12.6": (
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "
"nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | "
"nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | "
"nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | "
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'"
),
"12.8": (
"nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | "
"nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | "
"nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | "
"nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | "
"nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | "
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'"
),
"13.0": (
"nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | "
"nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | "
"nvidia-cublas==13.0.0.19; platform_system == 'Linux' | "
"nvidia-cufft==12.0.0.15; platform_system == 'Linux' | "
"nvidia-curand==10.4.0.35; platform_system == 'Linux' | "
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "
"nvidia-cufile==1.15.0.42; platform_system == 'Linux'"
),
"xpu": (
"intel-cmplr-lib-rt==2025.2.1 | "

94
.github/scripts/prepare_vllm_wheels.sh vendored Executable file
View File

@ -0,0 +1,94 @@
#!/usr/bin/env bash
set -eux
torch_version=$(unzip -p torch-* '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
nightly=$(echo ${torch_version} | cut -d'.' -f4)
# Copied from .ci/manywheel/build_common.sh
make_wheel_record() {
fpath=$1
if echo $fpath | grep RECORD >/dev/null 2>&1; then
echo "$fpath,,"
else
fhash=$(openssl dgst -sha256 -binary $fpath | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
fsize=$(ls -nl $fpath | awk '{print $5}')
echo "$fpath,sha256=$fhash,$fsize"
fi
}
change_wheel_version() {
local package=$1
local wheel=$2
local f_version=$3
local t_version=$4
# Extract the wheel
${PYTHON_EXECUTABLE} -mwheel unpack $wheel
mv "${package}-${f_version}" "${package}-${t_version}"
# Change the version from f_version to t_version in the dist-info dir
pushd "${package}-${t_version}"
mv "${package}-${f_version}.dist-info" "${package}-${t_version}.dist-info"
pushd "${package}-${t_version}.dist-info"
sed -i "s/${package}-${f_version}.dist-info/${package}-${t_version}.dist-info/g" RECORD
# Update the version in METADATA and its SHA256 hash
sed -i "s/Version: ${f_version}/Version: ${t_version}/g" METADATA
# then add PyTorch nightly dependency of vLLM
if [[ "${package}" == vllm ]] || [[ "${package}" == xformers ]]; then
sed -i "/License-File/a\Requires-Dist: torch==${torch_version}" METADATA
fi
sed -i '/METADATA,sha256/d' RECORD
popd
make_wheel_record "${package}-${t_version}.dist-info/METADATA" >> "${package}-${t_version}.dist-info/RECORD"
popd
# Repack the wheel
${PYTHON_EXECUTABLE} -mwheel pack "${package}-${t_version}"
# Clean up
rm -rf "${package}-${t_version}"
}
repackage_wheel() {
local package=$1
pushd $package
local orig_wheel=$(find . -name *${package//-/_}*)
local orig_version=$(unzip -p $orig_wheel '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
local version=""
if [[ "${package}" == vllm ]]; then
# Copied from vllm/.buildkite/scripts/upload-wheels.sh
version=1.0.0
else
version=$(echo $orig_version | tr '.+' '.' | cut -d'.' -f1-3)
fi
local nightly_version=$version.$nightly
# Use nightly version
change_wheel_version ${package//-/_} $orig_wheel $orig_version $nightly_version
# Clean up
rm "${orig_wheel}"
auditwheel repair --plat $PLATFORM *.whl \
--exclude libc10* --exclude libtorch* --exclude libcu* --exclude libnv*
local repair_wheel=$(find wheelhouse -name *${PLATFORM}*)
local repair_wheel=$(basename ${repair_wheel})
popd
cp ${package}/wheelhouse/${repair_wheel} .
rm -rf $package
}
# Require to re-package the wheel
${PYTHON_EXECUTABLE} -mpip install wheel==0.45.1
pushd externals/vllm/wheels
for package in xformers flashinfer-python vllm; do
repackage_wheel $package
done
popd

View File

@ -22,6 +22,16 @@ name: !{{ build_environment }}
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
{%- endmacro %}
{%- macro setup_python(py_ver) -%}
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "!{{ (py_ver.strip('t') + '.4') if '3.14' not in py_ver else '3.14.0-rc.2' }}"
freethreaded: !{{ "true" if py_ver.endswith('t') else "false" }}
{%- endmacro %}
on:
# TODO: Migrate to new ciflow trigger, reference https://github.com/pytorch/pytorch/pull/70321
push:
@ -61,23 +71,13 @@ jobs:
{%- endif %}
steps:
!{{ set_runner_specific_vars() }}
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
!{{ setup_python(config.get("python_version", "3.10")) }}
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -94,8 +94,6 @@ jobs:
{%- if config["package_type"] == "wheel" %}
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -106,33 +104,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086

View File

@ -47,12 +47,11 @@ jobs:
matrix:
include: [
{ name: "manylinux2_28-builder", tag: "cuda13.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.9", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.8", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.6", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda13.0", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.9", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.3", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cpu", runner: "linux.9xlarge.ephemeral" },

View File

@ -12,6 +12,9 @@ on:
paths:
- .github/workflows/build-vllm-wheel.yml
- .github/ci_commit_pins/vllm.txt
schedule:
# every morning at 01:30PM UTC, 9:30AM EST, 6:30AM PST
- cron: 30 13 * * *
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
@ -24,21 +27,33 @@ jobs:
fail-fast: false
matrix:
python-version: [ '3.12' ]
# TODO (huydhn): Add cu130 https://github.com/pytorch/pytorch/pull/162000#issuecomment-3261541554
# TODO (huydhn): Add cu130 after https://github.com/vllm-project/vllm/issues/24464 is resolved
platform: [ 'manylinux_2_28_x86_64', 'manylinux_2_28_aarch64' ]
device: [ 'cu128', 'cu129' ]
runner: [ 'linux.12xlarge.memory' ]
include:
- device: cu128
- platform: manylinux_2_28_x86_64
device: cu128
manylinux-image: 'pytorch/manylinux2_28-builder:cuda12.8'
- device: cu129
runner: linux.12xlarge.memory
- platform: manylinux_2_28_x86_64
device: cu129
manylinux-image: 'pytorch/manylinux2_28-builder:cuda12.9'
name: "Build ${{ matrix.device }} vLLM wheel"
runner: linux.12xlarge.memory
- platform: manylinux_2_28_aarch64
device: cu128
manylinux-image: 'pytorch/manylinuxaarch64-builder:cuda12.8'
runner: linux.arm64.r7g.12xlarge.memory
- platform: manylinux_2_28_aarch64
device: cu129
manylinux-image: 'pytorch/manylinuxaarch64-builder:cuda12.9'
runner: linux.arm64.r7g.12xlarge.memory
name: "Build ${{ matrix.device }} vLLM wheel on ${{ matrix.platform }}"
runs-on: ${{ matrix.runner }}
timeout-minutes: 480
env:
PY_VERS: ${{ matrix.python-version }}
MANYLINUX_IMAGE: ${{ matrix.manylinux-image }}
PLATFORM: 'manylinux_2_28_x86_64'
PLATFORM: ${{ matrix.platform }}
BUILD_DEVICE: ${{ matrix.device }}
steps:
- name: Setup SSH (Click me for login details)
@ -59,20 +74,6 @@ jobs:
run: |
set -eux
# Keep PyTorch nightly wheel here so that we can install it later during
# vLLM build process
mkdir -p "${RUNNER_TEMP}/artifacts/"
container_name=$(docker run \
--tty \
--detach \
-e PLATFORM \
-v "${GITHUB_WORKSPACE}:/pytorch" \
-v "${RUNNER_TEMP}/artifacts:/artifacts" \
-w /artifacts/ \
"${MANYLINUX_IMAGE}"
)
# Determine python executable for given version (copied from build-triton-wheel)
case $PY_VERS in
3.10)
@ -102,6 +103,21 @@ jobs:
;;
esac
# Keep PyTorch nightly wheel here so that we can install it later during
# vLLM build process
mkdir -p "${RUNNER_TEMP}/artifacts/"
container_name=$(docker run \
--tty \
--detach \
-e PLATFORM \
-e PYTHON_EXECUTABLE="${PYTHON_EXECUTABLE}" \
-v "${GITHUB_WORKSPACE}:/pytorch" \
-v "${RUNNER_TEMP}/artifacts:/artifacts" \
-w /artifacts/ \
"${MANYLINUX_IMAGE}"
)
docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" -mpip install \
--pre torch torchvision torchaudio \
--index-url "https://download.pytorch.org/whl/nightly/${BUILD_DEVICE}"
@ -113,7 +129,6 @@ jobs:
--index-url "https://download.pytorch.org/whl/nightly/${BUILD_DEVICE}"
# Save this for later
echo "PYTHON_EXECUTABLE=${PYTHON_EXECUTABLE}" >> "$GITHUB_ENV"
echo "container_name=${container_name}" >> "$GITHUB_ENV"
- name: Build vLLM wheel
@ -131,41 +146,12 @@ jobs:
set -eux
# Get these wheels ready, the vllm renaming logic is copied from its .buildkite/scripts/upload-wheels.sh
docker exec -t "${container_name}" bash -c "
set -eux
nightly=\$(unzip -p torch-* '**/METADATA' | grep '^Version: ' | cut -d' ' -f2 | cut -d'.' -f4)
pushd externals/vllm/wheels
for package in xformers flashinfer-python vllm; do
pushd \$package
auditwheel repair --plat \$PLATFORM *.whl \
--exclude libc10* --exclude libtorch* --exclude libcu* --exclude libnv*
repair_wheel=\$(find wheelhouse -name *\${PLATFORM}*)
repair_wheel=\$(basename \${repair_wheel})
popd
cp \${package}/wheelhouse/\${repair_wheel} .
version=\$(unzip -p \$repair_wheel '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
if [[ \$package == vllm ]]; then
new_wheel=\${repair_wheel/\$version/1.0.0.\$nightly}
else
major_version=\$(echo \$version | tr '.+' '.' | cut -d'.' -f1-3)
new_wheel=\${repair_wheel/\$version/\$major_version.\$nightly}
fi
mv -- \$repair_wheel \$new_wheel
rm -rf \$package
done
popd
"
docker exec -t "${container_name}" bash -c /pytorch/.github/scripts/prepare_vllm_wheels.sh
docker exec -t "${container_name}" chown -R 1000:1000 /artifacts
- uses: actions/upload-artifact@50769540e7f4bd5e21e526ee35c689e35e0d6874 # v4.4.0
with:
name: vllm-wheel-${{ matrix.device }}-${{ matrix.python-version }}-${{ env.PLATFORM }}
name: vllm-wheel-${{ matrix.device }}-${{ matrix.platform }}-${{ matrix.python-version }}
if-no-files-found: error
path: ${{ runner.temp }}/artifacts/externals/vllm/wheels/*.whl
@ -175,15 +161,17 @@ jobs:
# Copied from build-triton-wheel workflow (mostly)
upload-wheel:
name: "Upload ${{ matrix.device }} vLLM wheel"
name: "Upload ${{ matrix.device }} vLLM wheel on ${{ matrix.platform }}"
needs:
- build-wheel
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
platform: [ 'manylinux_2_28_x86_64', 'manylinux_2_28_aarch64' ]
device: [ 'cu128', 'cu129' ]
env:
PLATFORM: ${{ matrix.platform }}
BUILD_DEVICE: ${{ matrix.device }}
permissions:
id-token: write
@ -219,15 +207,15 @@ jobs:
run: |
set -eux
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv "${RUNNER_TEMP}"/artifacts-all/vllm-wheel-"${BUILD_DEVICE}"-*/* "${RUNNER_TEMP}/artifacts/"
mv "${RUNNER_TEMP}"/artifacts-all/vllm-wheel-"${BUILD_DEVICE}"-"${PLATFORM}"-*/* "${RUNNER_TEMP}/artifacts/"
- name: Set DRY_RUN (only for tagged pushes)
if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) }}
- name: Set DRY_RUN
if: ${{ (github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v'))) || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
shell: bash
run: |
echo "DRY_RUN=disabled" >> "$GITHUB_ENV"
- name: Set UPLOAD_CHANNEL (only for tagged pushes)
- name: Set UPLOAD_CHANNEL
if: ${{ github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/v') }}
shell: bash
run: |

View File

@ -112,6 +112,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -132,7 +224,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -223,6 +315,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -243,7 +427,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -334,6 +518,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -354,7 +630,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -445,6 +721,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -465,7 +833,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -556,6 +924,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -576,7 +1036,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -667,6 +1127,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -687,7 +1239,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -778,6 +1330,98 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14t-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: "12.6-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda-aarch64-12_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14t-cuda-aarch64-12_8-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -798,7 +1442,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -60,7 +60,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing

View File

@ -127,7 +127,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_6-test: # Testing
@ -193,7 +193,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing
@ -259,7 +259,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda13_0-test: # Testing
@ -719,7 +719,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-test: # Testing
@ -785,7 +785,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing
@ -851,7 +851,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda13_0-test: # Testing
@ -1311,7 +1311,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_6-test: # Testing
@ -1377,7 +1377,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
@ -1443,7 +1443,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda13_0-test: # Testing
@ -1903,7 +1903,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_6-test: # Testing
@ -1969,7 +1969,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing
@ -2035,7 +2035,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda13_0-test: # Testing
@ -2495,7 +2495,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_6-test: # Testing
@ -2561,7 +2561,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing
@ -2627,7 +2627,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda13_0-test: # Testing
@ -3087,7 +3087,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_6-test: # Testing
@ -3153,7 +3153,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_8-test: # Testing
@ -3219,7 +3219,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda13_0-test: # Testing
@ -3679,7 +3679,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_6-test: # Testing
@ -3745,7 +3745,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_8-test: # Testing
@ -3811,7 +3811,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand==10.4.0.35; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile==1.15.0.42; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda13_0-test: # Testing

View File

@ -60,13 +60,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.10.4"
freethreaded: false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -81,13 +81,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"

View File

@ -56,13 +56,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.10.4"
freethreaded: false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -77,13 +77,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -99,8 +95,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -111,33 +105,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086
@ -196,13 +166,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.11.4"
freethreaded: false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -217,13 +187,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -239,8 +205,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -251,33 +215,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086
@ -336,13 +276,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.12.4"
freethreaded: false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -357,13 +297,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -379,8 +315,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -391,33 +325,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086
@ -476,13 +386,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.13.4"
freethreaded: false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -497,13 +407,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -519,8 +425,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -531,33 +435,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086
@ -616,13 +496,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.13.4"
freethreaded: true
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -637,13 +517,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -659,8 +535,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -671,33 +545,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086
@ -756,13 +606,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.14.0-rc.2"
freethreaded: false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -777,13 +627,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -799,8 +645,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -811,33 +655,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086
@ -896,13 +716,13 @@ jobs:
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
- name: Setup Python
uses: actions/setup-python@v6
with:
# TODO: Removeme once 3.14 is out
# .4 version is min minor for 3.10, and also no-gil version of 3.13 needs at least 3.13.3
python-version: "3.14.0-rc.2"
freethreaded: true
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
@ -917,13 +737,9 @@ jobs:
working-directory: pytorch
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -939,8 +755,6 @@ jobs:
"${PYTORCH_ROOT}/.ci/wheel/build_wheel.sh"
- name: Test PyTorch wheel
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
set -eux -o pipefail
# shellcheck disable=SC1090
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
@ -951,33 +765,9 @@ jobs:
SMOKE_TEST_PARAMS=""
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
# shellcheck disable=SC2153
case $DESIRED_PYTHON in
3.14t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.14)
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge/label/python_rc -c conda-forge"
desired_python="3.14.0rc1"
;;
3.13t)
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
;;
*)
# shellcheck disable=SC2153
desired_python=${DESIRED_PYTHON}
;;
esac
# shellcheck disable=SC2086
conda create -yn "test_conda_env" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS} ${EXTRA_CONDA_INSTALL_FLAGS}
conda activate test_conda_env
python -mvenv test_venv
source test_venv/bin/activate
pip install "$PYTORCH_FINAL_PACKAGE_DIR"/*.whl numpy -v
# shellcheck disable=SC2086

View File

@ -37,7 +37,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
test-matrix: |
@ -56,7 +56,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: nightly-dynamo-benchmarks-build
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image: ${{ needs.nightly-dynamo-benchmarks-build.outputs.docker-image }}
test-matrix: ${{ needs.nightly-dynamo-benchmarks-build.outputs.test-matrix }}
timeout-minutes: 720

View File

@ -75,7 +75,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
@ -101,7 +101,7 @@ jobs:
needs: inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true
docker-image: ${{ needs.inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-build.outputs.test-matrix }}
@ -118,7 +118,7 @@ jobs:
needs: inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}
docker-image: ${{ needs.inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-build.outputs.test-matrix }}

View File

@ -80,7 +80,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
@ -107,7 +107,7 @@ jobs:
needs: inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true-freezing-true
docker-image: ${{ needs.inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-build.outputs.test-matrix }}
@ -124,7 +124,7 @@ jobs:
needs: inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}-freezing-${{ inputs.freezing }}
docker-image: ${{ needs.inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-build.outputs.test-matrix }}

View File

@ -39,7 +39,7 @@ jobs:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm86
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.6'
cuda-arch-list: '8.0;8.6'
test-matrix: |
{ include: [
{ config: "dynamo_eager_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
@ -62,7 +62,7 @@ jobs:
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
@ -154,7 +154,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
test-matrix: |
@ -200,7 +200,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: periodic-dynamo-benchmarks-cpu-build
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image: ${{ needs.periodic-dynamo-benchmarks-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.periodic-dynamo-benchmarks-cpu-build.outputs.test-matrix }}
secrets: inherit

View File

@ -3,18 +3,10 @@ name: inductor-rocm
on:
push:
branches:
#- main
- main
- release/*
tags:
- ciflow/inductor-rocm/*
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
- cron: 45 4,12,20 * * 1-5
- cron: 45 12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
workflow_dispatch:
concurrency:

View File

@ -110,7 +110,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
@ -127,7 +127,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: inductor-cpu-build
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image: ${{ needs.inductor-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-cpu-build.outputs.test-matrix }}
secrets: inherit

View File

@ -79,7 +79,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
@ -101,7 +101,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: inductor-cpu-build
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image: ${{ needs.inductor-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.inductor-cpu-build.outputs.test-matrix }}
secrets: inherit

View File

@ -105,7 +105,7 @@ jobs:
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
submodules: true
submodules: false
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
script: |
CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"

View File

@ -54,7 +54,7 @@ jobs:
- get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11
build-environment: linux-jammy-py3.10-gcc11
docker-image: ${{ needs.docs-build.outputs.docker-image }}
push: ${{ github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' || startsWith(github.event.ref, 'refs/tags/v') }}
run-doxygen: true

View File

@ -14,6 +14,10 @@ on:
schedule:
# Run at 07:00 UTC every Sunday
- cron: 0 7 * * 0
pull_request:
paths:
- benchmarks/operator_benchmark/**
- .github/workflows/operator_benchmark.yml
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
@ -29,7 +33,7 @@ jobs:
name: opbenchmark-build
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
@ -42,7 +46,7 @@ jobs:
name: opbenchmark-on-demand-build
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
@ -55,7 +59,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: opbenchmark-build
with:
build-environment: linux-jammy-py3.9-gcc11-build
build-environment: linux-jammy-py3.10-gcc11-build
docker-image: ${{ needs.opbenchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.opbenchmark-build.outputs.test-matrix }}
secrets: inherit

View File

@ -0,0 +1,54 @@
name: quantization-periodic
on:
push:
tags:
- ciflow/quantization-periodic/*
workflow_dispatch:
schedule:
# run weekly
- cron: "45 0 * * 0"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:
name: get-default-label-prefix
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
periodic-quantization-build:
name: periodic-quantization-build
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-cudnn9-py3-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '8.9'
test-matrix: |
{ include: [
{ config: "quantization", shard: 1, num_shards: 1, runner: "${{ needs.get-default-label-prefix.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },
]}
secrets: inherit
periodic-test-quantization:
name: periodic-test-quantization
uses: ./.github/workflows/_linux-test.yml
needs: periodic-quantization-build
with:
build-environment: linux-jammy-cuda12.8-cudnn9-py3-gcc11
docker-image: ${{ needs.periodic-quantization-build.outputs.docker-image }}
test-matrix: ${{ needs.periodic-quantization-build.outputs.test-matrix }}
secrets: inherit

View File

@ -3,19 +3,13 @@ name: rocm
on:
push:
branches:
# - main
- main
- release/*
tags:
- ciflow/rocm/*
workflow_dispatch:
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
- cron: 45 4,12,20 * * 1-5
- cron: 45 12 * * 0,6
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
- cron: 29 8 * * * # about 1:29am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

View File

@ -240,7 +240,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11
build-environment: linux-jammy-py3.10-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
@ -255,7 +255,7 @@ jobs:
- verify-cachebench-cpu-build
- target-determination
with:
build-environment: linux-jammy-py3.9-gcc11
build-environment: linux-jammy-py3.10-gcc11
docker-image: ${{ needs.verify-cachebench-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.verify-cachebench-cpu-build.outputs.test-matrix }}
secrets: inherit

View File

@ -2,6 +2,9 @@ name: vllm-test
on:
push:
branches:
- main
- release/*
tags:
- ciflow/vllm/*
workflow_dispatch:
@ -45,14 +48,18 @@ jobs:
{ config: "vllm_basic_models_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_entrypoints_test", shard: 1, num_shards: 1,runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_regression_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_280_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_multi_model_processor_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_pytorch_compilation_unit_tests", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_multi_model_test_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_languagde_model_test_extended_generation_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_distributed_test_2_gpu_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 0, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 1, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 2, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 3, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_tp_test_distributed", shard: 1, num_shards: 1, runner: "linux.aws.h100.4"},
{ config: "vllm_lora_tp_test_distributed", shard: 1, num_shards: 1, runner: "linux.g6.12xlarge.nvidia.gpu"},
{ config: "vllm_distributed_test_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.12xlarge.nvidia.gpu"}
]}
secrets: inherit

2
.gitignore vendored
View File

@ -389,3 +389,5 @@ android/pytorch_android_torchvision/.cxx
# Claude Code local configuration
CLAUDE.local.md
/test_*.py
/debug_*.py

View File

@ -13,7 +13,7 @@ exclude_patterns = [
'**/fb/**',
'functorch/docs/**',
'functorch/examples/**',
'functorch/notebooks/**',
'functorch/docs/source/tutorials/**',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'scripts/**',
@ -1568,7 +1568,6 @@ include_patterns = [
exclude_patterns = [
'caffe2/**',
'functorch/docs/**',
'functorch/notebooks/**',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'test/dynamo/cpython/**',

View File

@ -22,7 +22,6 @@ COMMON_COPTS = [
"-DHAVE_SHM_UNLINK=1",
"-D_FILE_OFFSET_BITS=64",
"-DUSE_FBGEMM",
"-DUSE_DISTRIBUTED",
"-DAT_PER_OPERATOR_HEADERS",
"-DATEN_THREADING=NATIVE",
"-DNO_CUDNN_DESTROY_HANDLE",
@ -811,7 +810,7 @@ cc_library(
name = "torch_python",
srcs = libtorch_python_core_sources
+ if_cuda(libtorch_python_cuda_sources)
+ if_cuda(libtorch_python_distributed_sources)
+ libtorch_python_distributed_sources
+ GENERATED_AUTOGRAD_PYTHON,
hdrs = glob([
"torch/csrc/generic/*.cpp",

View File

@ -181,8 +181,9 @@ elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(ppc64le)")
set(CPU_POWER ON)
endif()
# For non-supported platforms, turn USE_DISTRIBUTED off by default. It is not
# tested and likely won't work without additional changes.
# For non-supported platforms, turn USE_DISTRIBUTED off by default.
# NB: USE_DISTRIBUTED simply disables the backend; distributed code
# still gets built
if(NOT LINUX AND NOT WIN32)
set(USE_DISTRIBUTED
OFF
@ -233,6 +234,7 @@ cmake_dependent_option(INSTALL_TEST "Install test binaries if BUILD_TEST is on"
option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF)
option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON)
option(USE_ASAN "Use Address+Undefined Sanitizers" OFF)
option(USE_LSAN "Use Leak Sanitizer" OFF)
option(USE_TSAN "Use Thread Sanitizer" OFF)
option(USE_CUDA "Use CUDA" ON)
option(USE_XPU "Use XPU" ON)
@ -261,11 +263,11 @@ option(USE_PYTORCH_METAL "Use Metal for PyTorch iOS build" OFF)
option(USE_PYTORCH_METAL_EXPORT "Export Metal models on MacOSX desktop" OFF)
option(USE_NATIVE_ARCH "Use -march=native" OFF)
cmake_dependent_option(USE_MPS "Use MPS for macOS build" ON "MPS_FOUND" OFF)
option(USE_DISTRIBUTED "Use distributed" ON)
option(USE_DISTRIBUTED "Enable default distributed backends" ON)
cmake_dependent_option(USE_NCCL "Use NCCL" ON
"USE_DISTRIBUTED;USE_CUDA OR USE_ROCM;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_XCCL "Use XCCL" ON
"USE_XPU;UNIX;NOT APPLE" OFF)
"USE_DISTRIBUTED;USE_XPU;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON USE_NCCL OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON "USE_NCCL;NOT WIN32" OFF)
cmake_dependent_option(USE_STATIC_NCCL "Use static NCCL" OFF "USE_NCCL" OFF)
@ -430,11 +432,10 @@ if(WIN32)
PATH_SUFFIXES lib
NO_DEFAULT_PATH)
if(NOT libuv_tmp_LIBRARY)
set(USE_DISTRIBUTED OFF)
set(USE_GLOO OFF)
message(
WARNING
"Libuv is not installed in current conda env. Set USE_DISTRIBUTED to OFF. "
"Libuv is not installed in current conda env. Set USE_GLOO to OFF. "
"Please run command 'conda install -c conda-forge libuv=1.39' to install libuv."
)
else()
@ -873,7 +874,7 @@ cmake_dependent_option(
"Whether to build the flash_attention kernel for scaled dot product attention.\
Will be disabled if not supported by the platform"
ON
"USE_CUDA OR USE_ROCM;NOT MSVC"
"(USE_CUDA AND NOT MSVC) OR USE_ROCM"
OFF)
cmake_dependent_option(
@ -889,9 +890,9 @@ IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
set(USE_FBGEMM_GENAI off)
endif()
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0a")
message(WARNING "Setting USE_FBGEMM_GENAI to ON for CUDA build on SM100")
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100.
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
message(STATUS "Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI ON)
endif()
@ -908,7 +909,7 @@ cmake_dependent_option(
# USE_FLASH_ATTENTION -> USE_ROCM -> Dependencies.cmake -> aotriton.cmake
#
if(USE_ROCM)
if(UNIX AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
if(USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION)
include(cmake/External/aotriton.cmake)
endif()
endif()

View File

@ -50,6 +50,7 @@ Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | C++ | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- | --- |
| 2.9 | >=3.10, <=(3.14, 3.14t experimental) | C++17 | CUDA 12.6 (CUDNN 9.10.2.21), CUDA 12.8 (CUDNN 9.10.2.21) | CUDA 13.0 (CUDNN 9.13.0.50) | ROCm 6.4 |
| 2.8 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 12.6 (CUDNN 9.10.2.21), CUDA 12.8 (CUDNN 9.10.2.21) | CUDA 12.9 (CUDNN 9.10.2.21) | ROCm 6.4 |
| 2.7 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 11.8 (CUDNN 9.1.0.70), CUDA 12.6 (CUDNN 9.5.1.17) | CUDA 12.8 (CUDNN 9.7.1.26) | ROCm 6.3 |
| 2.6 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 11.8, CUDA 12.4 (CUDNN 9.1.0.70) | CUDA 12.6 (CUDNN 9.5.1.17) | ROCm 6.2.4 |

View File

@ -16,6 +16,8 @@ However, if you believe you have found a security vulnerability in PyTorch, we e
Please report security issues using https://github.com/pytorch/pytorch/security/advisories/new
All reports submitted thru the security advisories mechanism would **either be made public or dismissed by the team within 90 days of the submission**. If advisory has been closed on the grounds that it is not a security issue, please do not hesitate to create an [new issue](https://github.com/pytorch/pytorch/issues/new?template=bug-report.yml) as it is still likely a valid issue within the framework.
Please refer to the following page for our responsible disclosure policy, reward guidelines, and those things that should not be reported:
https://www.facebook.com/whitehat

View File

@ -265,6 +265,14 @@ IF(USE_FBGEMM_GENAI)
"${FBGEMM_GENAI_SRCS}/cutlass_extensions/**/*.cu")
list(FILTER fbgemm_genai_native_cuda_cu INCLUDE REGEX ${FBGEMM_CUTLASS_KERNELS_REGEX})
# PyTorch is not built for 10.0a in CI, due to lack of portability,
# so we need to explicitly build these files for 10.0a.
foreach(cu_file ${fbgemm_genai_native_cuda_cu})
_BUILD_FOR_ADDITIONAL_ARCHS(
"${cu_file}"
"100a")
endforeach()
file(GLOB_RECURSE fbgemm_genai_native_cuda_cpp
"${FBGEMM_GENAI_SRCS}/common/*.cpp"
)

View File

@ -133,12 +133,12 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
"resize_ called on tensor with symbolic shape")
TORCH_CHECK(
sparse_dim + dense_dim == static_cast<int64_t>(size.size()),
"number of dimensions must be sparse_dim (",
"'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = ",
size.size(),
", sparse_dim = ",
sparse_dim,
") + dense_dim (",
dense_dim,
"), but got ",
size.size());
", dense_dim = ",
dense_dim);
if (nnz() > 0) {
[[maybe_unused]] auto constexpr alt_options_msg =
"You could try the following options:\n\
@ -254,12 +254,12 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
"resize_and_clear_ called on tensor with symbolic shape")
TORCH_CHECK(
sparse_dim + dense_dim == static_cast<int64_t>(size.size()),
"number of dimensions must be sparse_dim (",
"'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = ",
size.size(),
", sparse_dim = ",
sparse_dim,
") + dense_dim (",
dense_dim,
"), but got ",
size.size());
", dense_dim = ",
dense_dim);
set_sizes_and_strides(size, std::vector<int64_t>(size.size()));
sparse_dim_ = sparse_dim;

View File

@ -64,6 +64,7 @@ constexpr DynamicTypeBits kDynamicClassTypeBit = DYNAMIC_TYPE_BIT(10);
_(ScalarType, kDynamicIntTypeBit, 1) \
_(Layout, kDynamicIntTypeBit, 1) \
_(SymInt, kDynamicIntTypeBit, 1) \
_(SymBool, kDynamicIntTypeBit, 1) \
_(MemoryFormat, kDynamicIntTypeBit, 1)
#define FORWARD_DECL_TYPE(NAME, _, __) struct NAME ## Type;

View File

@ -644,6 +644,8 @@ inline void bgemm_internal_cublas_half_helper(CUDABLAS_BGEMM_ARGTYPES_AND_C_DTYP
void * beta_ptr = &fbeta;
#ifdef USE_ROCM
int flag = 0;
rocblas_datatype c_type = std::is_same<C_Dtype, float>::value ? rocblas_datatype_f32_r : rocblas_datatype_f16_r;
rocblas_datatype d_type = c_type;
#if USE_GEMM_FLAGS_FP16_ALT_IMPL
flag = at::ROCmBackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
#endif
@ -652,8 +654,8 @@ inline void bgemm_internal_cublas_half_helper(CUDABLAS_BGEMM_ARGTYPES_AND_C_DTYP
hipOperationToRocOperation(opb), (int)m, (int)n, (int)k,
(void*)alpha_ptr, a, rocblas_datatype_f16_r, (int)lda, stridea,
b, rocblas_datatype_f16_r, (int)ldb, strideb,
(void*)beta_ptr, c, rocblas_datatype_f16_r, (int)ldc, stridec,
c, rocblas_datatype_f16_r, (int)ldc, stridec,
(void*)beta_ptr, c, c_type, (int)ldc, stridec,
c, d_type, (int)ldc, stridec,
(int) num_batches, rocblas_datatype_f32_r, rocblas_gemm_algo_standard,
0, flag)));
#else
@ -1096,6 +1098,8 @@ inline void gemm_internal_cublas_half_helper(CUDABLAS_GEMM_ARGTYPES_AND_C_DTYPE(
GEMM_CHECK_ARGVALUES(at::Half);
#ifdef USE_ROCM
int flag = 0;
rocblas_datatype c_type = std::is_same<C_Dtype, float>::value ? rocblas_datatype_f32_r : rocblas_datatype_f16_r;
rocblas_datatype d_type = c_type;
#if USE_GEMM_FLAGS_FP16_ALT_IMPL
flag = at::ROCmBackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
#endif
@ -1115,10 +1119,10 @@ inline void gemm_internal_cublas_half_helper(CUDABLAS_GEMM_ARGTYPES_AND_C_DTYPE(
ldb,
beta_ptr,
c,
rocblas_datatype_f16_r,
c_type,
ldc,
c,
rocblas_datatype_f16_r,
d_type,
ldc,
rocblas_datatype_f32_r,
rocblas_gemm_algo_standard,

View File

@ -45,6 +45,24 @@ struct OffsetCalculator {
C10_HOST_DEVICE offset_type get(index_t linear_idx) const {
offset_type offsets;
#if defined(USE_ROCM)
if ((dims > 0) && (dims <= 2)) {
auto divmod = sizes_[0].divmod(linear_idx);
#pragma unroll
for (int arg = 0; arg < NARGS; arg++)
offsets[arg] = divmod.mod * strides_[0][arg];
if (dims >= 2) {
divmod = sizes_[1].divmod(divmod.div);
#pragma unroll
for (int arg = 0; arg < NARGS; arg++)
offsets[arg] += divmod.mod * strides_[1][arg];
}
// [...]
return offsets;
}
#endif
#pragma unroll
for (int arg = 0; arg < NARGS; arg++) {
offsets[arg] = 0;

View File

@ -457,24 +457,9 @@ void gemm(
return;
}
#endif
// for the fallback path, first compute gemm with beta = 0,
// and then add c in full precision.
int64_t c_size = n * m;
std::vector<float> float_c(c_size, 0.f);
gemm_no_downcast_stub(
at::kCPU, at::kBFloat16,
transa, transb, m, n, k, alpha, a, lda, b, ldb, 0.f, float_c.data(), m);
for (const auto j : c10::irange(n)) {
for (const auto i : c10::irange(m)) {
auto offset = j * ldc + i;
// beta == 0 won't propagate NaN from C
if (beta == 0.f) {
c[offset] = float_c[j * m + i];
} else {
c[offset] = beta * c[offset] + float_c[j * m + i];
}
}
}
transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}
void gemm(
@ -493,24 +478,9 @@ void gemm(
return;
}
#endif
// for the fallback path, first compute gemm with beta = 0,
// and then add c in full precision.
int64_t c_size = n * m;
std::vector<float> float_c(c_size, 0.f);
gemm_no_downcast_stub(
at::kCPU, at::kHalf,
transa, transb, m, n, k, alpha, a, lda, b, ldb, 0.f, float_c.data(), m);
for (const auto j : c10::irange(n)) {
for (const auto i : c10::irange(m)) {
auto offset = j * ldc + i;
// beta == 0 won't propagate NaN from C
if (beta == 0.f) {
c[offset] = float_c[j * m + i];
} else {
c[offset] = beta * c[offset] + float_c[j * m + i];
}
}
}
transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
}
void gemm(

View File

@ -14,6 +14,7 @@
#include <c10/util/accumulate.h>
#include <c10/util/irange.h>
#include <c10/macros/Macros.h>
#include <algorithm>
#include <limits>
#include <utility>
@ -300,67 +301,50 @@ struct ConvParams {
bool allow_tf32{};
bool is_strided() const {
bool is_strided = false;
for (const auto& s : stride) {
is_strided |= (s != 1);
}
return is_strided;
return std::any_of(
stride.cbegin(), stride.cend(), [](const T& s) { return s != 1; });
}
bool is_dilated() const {
bool is_dilated = false;
for (const auto& d : dilation) {
is_dilated |= (d != 1);
}
return is_dilated;
return std::any_of(
dilation.cbegin(), dilation.cend(), [](const T& d) { return d != 1; });
}
bool is_padded() const {
bool is_padded = false;
for (auto p : padding) {
is_padded |= (p != 0);
}
return is_padded;
return std::any_of(
padding.cbegin(), padding.cend(), [](const T& p) { return p != 0; });
}
bool is_output_padding_neg() const {
bool is_non_neg = false;
for (const auto& p : output_padding) {
is_non_neg |= (p < 0);
}
return is_non_neg;
return std::any_of(
output_padding.cbegin(),
output_padding.cend(),
[](const T& p) { return p < 0; });
}
bool is_output_padding_big() const {
bool is_big = false;
// Revisit this with std::views::zip at C++20.
for (auto i: c10::irange(output_padding.size())) {
is_big |= (output_padding[i] >= stride[i]);
if (output_padding[i] >= stride[i]) {
return true;
}
}
return is_big;
return false;
}
bool is_padding_neg() const {
bool is_non_neg = false;
for (const auto& p : padding) {
is_non_neg |= (p < 0);
}
return is_non_neg;
return std::any_of(
padding.cbegin(), padding.cend(), [](const T& p) { return p < 0; });
}
bool is_dilation_neg() const {
bool is_non_neg = false;
for (const auto& p : dilation) {
is_non_neg |= (p < 0);
}
return is_non_neg;
return std::any_of(
dilation.cbegin(), dilation.cend(), [](const T& d) { return d < 0; });
}
bool is_stride_nonpos() const {
bool is_nonpos = false;
for (const auto& s : stride) {
is_nonpos |= (s <= 0);
}
return is_nonpos;
return std::any_of(
stride.cbegin(), stride.cend(), [](const T& s) { return s <= 0; });
}
void view1d_as_2d() {

View File

@ -1360,7 +1360,8 @@ Tensor outer(const Tensor& self, const Tensor& vec2) {
#endif
#if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED()
#if !defined(__aarch64__) || AT_MKLDNN_ACL_ENABLED()
// Used by default on x86 platforms and on AArch64+ACL
static inline int64_t get_mkldnn_matmul_min_dim() {
static auto value = [&] {
const int64_t default_min_dim = [&] {
@ -1395,8 +1396,6 @@ static inline bool apply_mkldnn_matmul_heur(int64_t m, int64_t k, int64_t n) {
return at::globalContext().userEnabledMkldnn() && m > min_dim && k > min_dim && n > min_dim && m * k * n > min_size;
}
#endif
static void addmm_impl_cpu_(
Tensor &result, const Tensor &self, Tensor m1, Tensor m2, const Scalar& beta, const Scalar& alpha) {
TORCH_INTERNAL_ASSERT(self.dim() == 2 && m1.dim() == 2 && m2.dim() == 2);
@ -1772,8 +1771,8 @@ static inline void bmm_out_or_baddbmm_(const Tensor& self_or_result_, const Tens
return (strides[2] == 1 && (sizes[1] == 1 || strides[1] >= sizes[2])) ||
(strides[1] == 1 && (sizes[2] == 1 || strides[2] >= sizes[1]));
};
#if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED()
#if !defined(__aarch64__) || AT_MKLDNN_ACL_ENABLED()
// Always apply mkldnn heuristic on x86 platform, but on ARM only if compiled with ACL
bool apply_heur = apply_mkldnn_matmul_heur(batch1.sizes()[1], batch1.sizes()[2], batch2.sizes()[2]);
if (apply_heur && use_mkldnn_matmul(batch1, batch2, self_or_result)) {
try {
@ -1785,7 +1784,6 @@ static inline void bmm_out_or_baddbmm_(const Tensor& self_or_result_, const Tens
}
}
#endif
if (contraction_size * res_rows * res_cols < 400) {
if (is_bmm_out) {
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(kBFloat16, kHalf, batch1.scalar_type(), "bmm", [&] {

View File

@ -624,7 +624,9 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, int64_t> _batch_norm_impl_index(
if (backend == BatchNormBackend::Miopen) {
return std::tuple_cat(
at::miopen_batch_norm(
input.contiguous(), weight.contiguous(), bias.contiguous(),
input.contiguous(input.suggest_memory_format()),
weight.contiguous(),
bias.contiguous(),
running_mean.defined() ? running_mean.contiguous() : running_mean,
running_var.defined() ? running_var.contiguous() : running_var,
training, momentum, eps),

View File

@ -1,5 +1,6 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/core/Tensor.h>
#include <ATen/DTensorState.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
@ -24,8 +25,13 @@ Tensor one_hot(const Tensor &self, int64_t num_classes) {
if (num_classes == -1) {
num_classes = self.max().item().toLong() + 1;
}
at::Tensor index = at::arange(num_classes, self.options());
return at::eq(self.unsqueeze(-1), index).to(kLong);
{
// If `self` is a DTensor, then allow implicit replication
// of the `index` Tensor.
at::DTensorAllowImplicitReplication guard;
at::Tensor index = at::arange(num_classes, self.options());
return at::eq(self.unsqueeze(-1), index).to(kLong);
}
}
auto shape = self.sizes().vec();

View File

@ -2174,7 +2174,7 @@ static void _scatter_via_index_put(
if (self.dim() == 1 || broadcast_index) {
Tensor squeezed = index;
if (broadcast_index && index.dim() > 1) {
for (const auto d : c10::irange(index.dim())) {
for (int64_t d = index.dim() - 1; d >= 0; --d) {
if (d == dim) {
continue;
}

View File

@ -18,6 +18,7 @@
#include <ATen/ops/is_set_to_native.h>
#include <ATen/ops/size_native.h>
#include <ATen/ops/stride_native.h>
#include <ATen/ops/sym_is_contiguous_native.h>
#include <ATen/ops/sym_numel_native.h>
#include <ATen/ops/sym_size_native.h>
#include <ATen/ops/sym_storage_offset_native.h>
@ -57,6 +58,12 @@ c10::SymInt sym_size(const Tensor& self, int64_t dim) {
return self.sym_size(dim);
}
c10::SymBool sym_is_contiguous(
const Tensor& self,
c10::MemoryFormat memory_format) {
return self.sym_is_contiguous(memory_format);
}
c10::SymInt sym_stride(const Tensor& self, int64_t dim) {
return self.sym_stride(dim);
}

View File

@ -36,7 +36,7 @@ void hardsigmoid_kernel(TensorIteratorBase& iter) {
[zero, one_sixth, three, six] GPU_LAMBDA(
scalar_t self_val) -> scalar_t {
opmath_t x = static_cast<opmath_t>(self_val);
return std::min(std::max(x + three, zero), six) * one_sixth;
return std::min<opmath_t>(std::max<opmath_t>(x + three, zero), six) * one_sixth;
});
});
}

View File

@ -1080,16 +1080,6 @@ static bool _scaled_mm_allowed_device(bool sm90_only=false, bool sm100_only=fals
#endif
}
static bool _grouped_mm_allowed_device() {
#ifdef USE_ROCM
return false;
#else
auto dprops = at::cuda::getCurrentDeviceProperties();
// CUDA capability 8.0 and greater
return dprops->major >= 8;
#endif
}
#ifdef USE_ROCM
static bool _scaled_mm_is_fnuz() {
return at::detail::getCUDAHooks().isGPUArch({"gfx942"});
@ -1786,14 +1776,19 @@ Tensor _grouped_mm_cuda(const Tensor& mat_a, const Tensor& mat_b,
const std::optional<at::Tensor>& offs,
const std::optional<at::Tensor>& bias,
std::optional<c10::ScalarType> out_dtype) {
#ifndef USE_ROCM
_grouped_mm_validate_inputs(mat_a, mat_b, offs, bias, out_dtype);
bool a_b_and_out_are_bf16 = (
mat_a.dtype() == at::kBFloat16 &&
mat_b.dtype() == at::kBFloat16 &&
out_dtype.value_or(at::kBFloat16) == at::kBFloat16
);
#ifndef USE_ROCM
bool use_fast_path = _scaled_mm_allowed_device(/*sm90_only*/true, /*sm100_only*/true) && a_b_and_out_are_bf16;
#else
// _scaled_mm_allowed_device is used here within _grouped_mm_cuda which seems incorrect since scale is not used.
// the _grouped_mm_fallback should be safe for any ROCm GPU since it's just calling typical mm/bmm
bool use_fast_path = false;
#endif
const auto out_dtype_ = _resolve_grouped_mm_out_dtype(mat_a, mat_b, out_dtype);
Tensor out = create_grouped_gemm_output_tensor(mat_a, mat_b, offs, out_dtype_);
if (use_fast_path) {
@ -1803,9 +1798,6 @@ std::optional<c10::ScalarType> out_dtype) {
_grouped_mm_fallback(mat_a, mat_b, offs, bias, out_dtype, out);
}
return out;
#else
TORCH_CHECK(false, "grouped gemm is not supported on ROCM")
#endif
}
Tensor _bmm_dtype_cuda(const Tensor& batch1, const Tensor& batch2, const at::ScalarType out_dtype) {

View File

@ -317,6 +317,17 @@ void nonzero_static_cuda_out_impl(
out_temp =
Tensor(at::detail::empty_cuda({self.dim(), size}, out.options())).t();
}
// If input has zero elements, avoid kernel grid calculations (which can
// produce zero divisors) and just fill the output with fill_value.
if (self.numel() == 0) {
if (need_to_copy) {
out_temp.fill_(fill_value);
out.copy_(out_temp);
} else {
out.fill_(fill_value);
}
return;
}
int64_t* out_data_ptr = need_to_copy ? out_temp.mutable_data_ptr<int64_t>()
: out.mutable_data_ptr<int64_t>();

View File

@ -146,7 +146,7 @@ namespace native {
namespace fe = cudnn_frontend;
#define MAX_MHA_DIM 4
constexpr uint8_t MAX_MHA_DIM = 4;
// Whether we will use ragged offsets in the dense (non-nested) path
// to avoid recompilation
@ -238,7 +238,8 @@ void setMHAParams(
const std::optional<Tensor>& attn_bias,
double dropout_probability,
bool is_causal,
bool return_softmaxstats) {
bool return_softmaxstats,
bool is_nested) {
memset(&params, 0, sizeof(MHAParams));
params.device_id = at::cuda::current_device();
params.dataType = fe::DataType_t::HALF;
@ -255,23 +256,24 @@ void setMHAParams(
params.is_causal = is_causal;
params.return_softmaxstats = return_softmaxstats;
params.has_attn_bias = attn_bias.has_value();
// Expect 4D dense tensor, 3D nested case (THD)
TORCH_INTERNAL_ASSERT(
q.sizes().size() == MAX_MHA_DIM,
q.sizes().size() == (uint8_t)(MAX_MHA_DIM - (uint8_t)is_nested),
"Q tensor has unexpected number of dims, please report a bug to PyTorch.");
TORCH_INTERNAL_ASSERT(
q.strides().size() == MAX_MHA_DIM,
q.strides().size() == (uint8_t)(MAX_MHA_DIM - (uint8_t)is_nested),
"Q tensor has unexpected number of dims, please report a bug to PyTorch.");
TORCH_INTERNAL_ASSERT(
k.sizes().size() == MAX_MHA_DIM,
k.sizes().size() == (uint8_t)(MAX_MHA_DIM - (uint8_t)is_nested),
"K tensor has unexpected number of dims, please report a bug to PyTorch.");
TORCH_INTERNAL_ASSERT(
k.strides().size() == MAX_MHA_DIM,
k.strides().size() == (uint8_t)(MAX_MHA_DIM - (uint8_t)is_nested),
"K tensor has unexpected number of dims, please report a bug to PyTorch.");
TORCH_INTERNAL_ASSERT(
v.sizes().size() == MAX_MHA_DIM,
v.sizes().size() == (uint8_t)(MAX_MHA_DIM - (uint8_t)is_nested),
"V tensor has unexpected number of dims, please report a bug to PyTorch.");
TORCH_INTERNAL_ASSERT(
v.strides().size() == MAX_MHA_DIM,
v.strides().size() == (uint8_t)(MAX_MHA_DIM - (uint8_t)is_nested),
"V tensor has unexpected number of dims, please report a bug to PyTorch.");
std::copy(q.sizes().begin(), q.sizes().end(), params.q_dim.begin());
std::copy(q.strides().begin(), q.strides().end(), params.q_stride.begin());
@ -320,7 +322,8 @@ struct MHACacheKeyWrapper : ParamsWrapper<MHAParams> {
const std::optional<Tensor>& attn_bias,
double dropout_probability,
bool is_causal,
bool return_softmaxstats) {
bool return_softmaxstats,
bool is_nested) {
setMHAParams(
this->pod,
b,
@ -335,7 +338,8 @@ struct MHACacheKeyWrapper : ParamsWrapper<MHAParams> {
attn_bias,
dropout_probability,
is_causal,
return_softmaxstats);
return_softmaxstats,
is_nested);
}
};
@ -478,7 +482,7 @@ auto build_graph(
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("CUDNN_SDPA")
.set_is_inference(return_softmaxstats == false)
.set_generate_stats(return_softmaxstats)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale);
if (use_ragged_in_dense(q, k, v, o, attn_bias.has_value())) {
@ -698,7 +702,7 @@ auto build_graph_nestedtensor(
auto scaled_dot_product_flash_attention_options =
fe::graph::SDPA_attributes()
.set_name("CUDNN_SDPA_NESTEDTENSOR")
.set_is_inference(return_softmaxstats == false)
.set_generate_stats(return_softmaxstats)
.set_causal_mask(is_causal)
.set_attn_scale(attn_scale)
.set_seq_len_q(SEQ_LEN_Q_)
@ -1386,7 +1390,8 @@ void run_cudnn_SDP_fprop(
attn_bias,
dropout_probability,
is_causal,
return_softmaxstats);
return_softmaxstats,
false);
auto graph_ptr = getMHAGraphCache_().find(key);
std::shared_ptr<fe::graph::Graph> mha_graph;
if (graph_ptr) {
@ -1484,30 +1489,53 @@ void run_cudnn_SDP_fprop_nestedtensor(
if (return_softmaxstats && !softmaxstats.defined()) {
softmaxstats = at::empty({q.size(0), h_q, 1}, q.options().dtype(kFloat));
}
auto mha_graph = build_graph_nestedtensor(
auto key = MHACacheKeyWrapper(
b,
h_q,
h_k,
h_v,
s_q,
s_kv,
s_q, // max-seqlen-q
s_kv, // max-seqlen-kv
d_qk,
d_v,
scaling_factor,
return_softmaxstats,
is_causal,
dropout_probability,
cum_seqlen_q,
cum_seqlen_kv,
q,
k,
v,
attn_bias,
softmaxstats,
o,
dropoutseed,
dropoutoffset,
handle);
dropout_probability,
is_causal,
return_softmaxstats,
true);
auto graph_ptr = getMHAGraphCache_().find(key);
std::shared_ptr<fe::graph::Graph> mha_graph;
if (graph_ptr) {
mha_graph = *graph_ptr;
} else {
mha_graph = build_graph_nestedtensor(
b,
h_q,
h_k,
h_v,
s_q,
s_kv,
d_qk,
d_v,
scaling_factor,
return_softmaxstats,
is_causal,
dropout_probability,
cum_seqlen_q,
cum_seqlen_kv,
q,
k,
v,
attn_bias,
softmaxstats,
o,
dropoutseed,
dropoutoffset,
handle);
}
auto seqlen_q = at::diff(cum_seqlen_q, 1, 0);
auto seqlen_kv = at::diff(cum_seqlen_kv, 1, 0);
auto rag_q_off = cum_seqlen_q.mul(h_q * d_qk);
@ -1636,7 +1664,8 @@ void run_cudnn_SDP_bprop(
attn_bias,
dropout_probability,
is_causal,
true);
true,
false);
auto graph_backward_ptr = getMHAGraphBackwardCache_().find(key);
std::shared_ptr<fe::graph::Graph> mha_graph;
if (graph_backward_ptr) {
@ -1761,33 +1790,55 @@ void run_cudnn_SDP_bprop_nestedtensor(
cudnnHandle_t handle = getCudnnHandle();
auto mha_graph = build_graph_backward_nestedtensor(
auto key = MHACacheKeyWrapper(
b,
h_q,
h_k,
h_v,
s_q,
s_kv,
s_q, // max-seqlen-q
s_kv, // max-seqlen-kv
d_qk,
d_v,
scaling_factor,
is_causal,
dropout_probability,
cum_seqlen_q,
cum_seqlen_kv,
q,
k,
v,
attn_bias,
o,
dO_,
softmaxstats,
dQ,
dK,
dV,
dropoutseed,
dropoutoffset,
handle);
dropout_probability,
is_causal,
true,
true);
auto graph_ptr = getMHAGraphCache_().find(key);
std::shared_ptr<fe::graph::Graph> mha_graph;
if (graph_ptr) {
mha_graph = *graph_ptr;
} else {
mha_graph = build_graph_backward_nestedtensor(
b,
h_q,
h_k,
h_v,
s_q,
s_kv,
d_qk,
d_v,
scaling_factor,
is_causal,
dropout_probability,
cum_seqlen_q,
cum_seqlen_kv,
q,
k,
v,
attn_bias,
o,
dO_,
softmaxstats,
dQ,
dK,
dV,
dropoutseed,
dropoutoffset,
handle);
}
std::unordered_map<int64_t, void*> variant_pack = {
// inputs

View File

@ -7,6 +7,7 @@
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/empty.h>
#include <ATen/ops/empty_like.h>
#include <ATen/ops/miopen_batch_norm_native.h>
#include <ATen/ops/miopen_batch_norm_backward_native.h>
#endif
@ -102,7 +103,7 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm(
mode = miopenBNSpatial;
}
auto output_t = at::empty(input->sizes(), input->options());
auto output_t = at::empty_like(input_t, input_t.options(), input_t.suggest_memory_format());
TensorArg output{ output_t, "output", 0 };
auto handle = getMiopenHandle();
@ -170,20 +171,15 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm_backward(
const std::optional<Tensor>& save_var_t_opt,
double epsilon) {
// See [Note: hacky wrapper removal for optional tensor]
const Tensor& running_mean =
running_mean_opt.value_or(Tensor());
const Tensor& running_var =
running_var_opt.value_or(Tensor());
const Tensor& save_mean_t =
save_mean_t_opt.value_or(Tensor());
const Tensor& save_var_t =
save_var_t_opt.value_or(Tensor());
const Tensor& save_mean_t = save_mean_t_opt.value_or(Tensor());
const Tensor& save_var_t = save_var_t_opt.value_or(Tensor());
TensorArg input{ input_t, "input", 1 },
grad_output{ grad_output_t, "grad_output", 2 },
weight{ weight_t, "weight", 3 },
save_mean{ save_mean_t, "save_mean", 4 },
save_var{ save_var_t, "save_var", 5 };
auto grad_output_contig =
grad_output_t.contiguous(input_t.suggest_memory_format());
TensorArg input{input_t, "input", 1},
grad_output{grad_output_contig, "grad_output", 2},
weight{weight_t, "weight", 3}, save_mean{save_mean_t, "save_mean", 4},
save_var{save_var_t, "save_var", 5};
CheckedFrom c = "miopen_batch_norm_backward";
checkAllDefined(c, {input, grad_output, weight, save_mean, save_var});
@ -195,7 +191,11 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm_backward(
}
checkAllSameType(c, {input, grad_output});
checkAllSameType(c, {weight, save_mean, save_var});
checkAllContiguous(c, {input, grad_output, save_mean, save_var});
// TODO: is weight required to be contiguous?
checkAllContiguous(c, {save_mean, save_var});
// TODO: TensorArg check should start handle memory format
TORCH_CHECK(input->is_contiguous(input->suggest_memory_format()));
TORCH_CHECK(grad_output->is_contiguous(input->suggest_memory_format()));
checkDimRange(c, input, 2, 6 /* exclusive */);
checkSameSize(c, input, grad_output);
auto num_features = input->size(1);
@ -210,7 +210,7 @@ std::tuple<Tensor, Tensor, Tensor> miopen_batch_norm_backward(
mode = miopenBNSpatial;
}
auto grad_input_t = at::empty(input->sizes(), input->options());
auto grad_input_t = at::empty(input->sizes(), input->options(), input->suggest_memory_format());
auto grad_weight_t = at::empty(weight->sizes(), weight->options());
auto grad_bias_t = at::empty(weight->sizes(), weight->options());

View File

@ -260,7 +260,7 @@ _scaled_dot_product_fused_attention_overrideable_xpu(
alloc_with_matching_layout(query, output, output_shape);
at::Tensor logsumexp, debug_attn_mask; // not supported
at::native::onednn::gpu_float_sdpa(
at::native::onednn::sdpa(
batch_size,
seq_len_q,
seq_len_kv,
@ -274,7 +274,9 @@ _scaled_dot_product_fused_attention_overrideable_xpu(
attn_bias,
is_causal,
scale.has_value() ? scale.value() : (1.0 / std::sqrt(head_dim_qk)),
output);
output,
false,
logsumexp);
// rng not used
auto philox_seed = at::empty({}, at::dtype(at::kLong));

View File

@ -13,6 +13,9 @@ using dims = logical_tensor::dims;
using op = dnnl::graph::op;
using partition = dnnl::graph::partition;
constexpr logical_tensor::data_type sdpa_intermediate_dtype =
logical_tensor::data_type::f32;
inline data_type to_logical_tensor_data_type(c10::ScalarType scalar_type) {
return scalar_type == c10::ScalarType::Float ? data_type::f32
: scalar_type == c10::ScalarType::Half ? data_type::f16
@ -20,6 +23,8 @@ inline data_type to_logical_tensor_data_type(c10::ScalarType scalar_type) {
: data_type::undef;
}
namespace sdpa_forward {
struct SDPALogicalParams {
enum class TensorID {
query,
@ -28,7 +33,8 @@ struct SDPALogicalParams {
neg_inf,
attn_mask,
value,
output,
attention,
logsumexp,
end,
};
@ -38,14 +44,16 @@ struct SDPALogicalParams {
std::optional<logical_tensor> neg_inf;
std::optional<logical_tensor> attn_mask;
logical_tensor value{};
logical_tensor output{};
logical_tensor attention{};
std::optional<logical_tensor> logsumexp;
SDPALogicalParams(
const at::Tensor& query_,
const at::Tensor& key_,
const at::Tensor& value_,
const std::optional<at::Tensor>& attn_mask_,
const at::Tensor& output_,
const at::Tensor& attention_,
const at::Tensor& logsumexp_,
int batch_size,
int seq_len_q,
int seq_len_kv,
@ -53,19 +61,26 @@ struct SDPALogicalParams {
int num_head_kv,
int head_dim_qk,
int head_dim_v,
bool is_causal) {
bool is_causal,
bool compute_logsumexp) {
const data_type dtype = to_logical_tensor_data_type(query_.scalar_type());
TORCH_INTERNAL_ASSERT(
(dtype != data_type::undef),
"Only FP16/BF16/FP32 datatypes are currently supported");
TORCH_INTERNAL_ASSERT(
query_.scalar_type() == attention_.scalar_type(),
"scaled_dot_product_attention_xpu: query and attention tensors should have the same data type.");
const dims scalar_shape = {1};
std::vector<logical_tensor> inputLogicalTensors;
at::Tensor reshaped_query = query_;
at::Tensor reshaped_key = key_;
at::Tensor reshaped_value = value_;
at::Tensor reshaped_output = output_;
at::Tensor reshaped_attention = attention_;
at::Tensor reshaped_logsumexp =
compute_logsumexp ? logsumexp_.unsqueeze(-1) : logsumexp_;
at::Tensor reshaped_attn_mask = attn_mask_.value_or(at::Tensor());
// handle broadcasted input tensors for OneDNN
if (at::native::onednn::is_broadcast(reshaped_query)) {
at::native::onednn::undo_broadcast(reshaped_query);
}
@ -75,9 +90,6 @@ struct SDPALogicalParams {
if (at::native::onednn::is_broadcast(reshaped_value)) {
at::native::onednn::undo_broadcast(reshaped_value);
}
if (at::native::onednn::is_broadcast(reshaped_output)) {
at::native::onednn::undo_broadcast(reshaped_output);
}
if (attn_mask_.has_value() &&
at::native::onednn::is_broadcast(reshaped_attn_mask)) {
at::native::onednn::undo_broadcast(reshaped_attn_mask);
@ -95,23 +107,22 @@ struct SDPALogicalParams {
{batch_size, group_num, group_size, seq_len_q, head_dim_qk});
reshaped_key = key_.unsqueeze(2);
reshaped_value = value_.unsqueeze(2);
reshaped_output = output_.view(
reshaped_attention = attention_.view(
{batch_size, group_num, group_size, seq_len_q, head_dim_v});
if (attn_mask_.has_value() && attn_mask_.value().dim() == 4) {
reshaped_attn_mask = attn_mask_.value().unsqueeze(2);
}
}
query = {
static_cast<size_t>(TensorID::query),
dtype,
reshaped_query.sizes().vec(),
reshaped_query.strides().vec()};
key = {
static_cast<size_t>(TensorID::key),
dtype,
reshaped_key.sizes().vec(),
reshaped_key.strides().vec()};
#define LOGIC_TENSOR_DESC(name, dtype) \
name = { \
static_cast<size_t>(TensorID::name), \
dtype, \
reshaped_##name.sizes().vec(), \
reshaped_##name.strides().vec()}
LOGIC_TENSOR_DESC(query, dtype);
LOGIC_TENSOR_DESC(key, dtype);
scale = {
static_cast<size_t>(TensorID::scale),
to_logical_tensor_data_type(at::toOpMathType(query_.scalar_type())),
@ -132,22 +143,19 @@ struct SDPALogicalParams {
TORCH_INTERNAL_ASSERT(
(mask_dtype != data_type::undef),
"Only FP16/BF16/FP32 datatypes are currently supported for attn_mask");
attn_mask = {
static_cast<size_t>(TensorID::attn_mask),
mask_dtype,
reshaped_attn_mask.sizes().vec(),
reshaped_attn_mask.strides().vec()};
LOGIC_TENSOR_DESC(attn_mask, mask_dtype);
}
value = {
static_cast<size_t>(TensorID::value),
dtype,
reshaped_value.sizes().vec(),
reshaped_value.strides().vec()};
output = {
static_cast<size_t>(TensorID::output),
dtype,
reshaped_output.sizes().vec(),
reshaped_output.strides().vec()};
LOGIC_TENSOR_DESC(value, dtype);
LOGIC_TENSOR_DESC(attention, dtype);
if (compute_logsumexp) {
TORCH_INTERNAL_ASSERT(
logsumexp_.scalar_type() == at::kFloat,
"scaled_dot_product_attention: Expected logsumexp data type in FP32, but got ",
logsumexp_.scalar_type(),
" instead.");
LOGIC_TENSOR_DESC(logsumexp, sdpa_intermediate_dtype);
}
#undef LOGIC_TENSOR_DESC
}
std::vector<logical_tensor> get_input() const {
std::vector<logical_tensor> input = {query, key, scale};
@ -161,16 +169,21 @@ struct SDPALogicalParams {
return input;
}
std::vector<logical_tensor> get_output() const {
return {output};
std::vector<logical_tensor> output;
output.push_back(attention);
if (logsumexp.has_value()) {
output.push_back(logsumexp.value());
}
return output;
}
};
partition create_sdpa_graph_partition(
bool is_causal,
bool compute_logsumexp,
data_type dtype,
const SDPALogicalParams& params) {
// graph building and partitioning
// currently, we assume that Q and K have same sequence length
size_t lt_id = static_cast<size_t>(SDPALogicalParams::TensorID::end);
size_t op_id = 0;
@ -180,7 +193,7 @@ partition create_sdpa_graph_partition(
// Matrix Extensions (Intel(R) XMX) support, which means the
// Q/K/V tensors have bf16 or f16 data type while the output of the first
// MatMul, Scale, Mask, and the input of SoftMax are in f32 data type.
logical_tensor matmul_qk_out{lt_id++, data_type::f32};
logical_tensor matmul_qk_out{lt_id++, sdpa_intermediate_dtype};
op matmul_qk{
op_id++,
op::kind::MatMul,
@ -189,7 +202,7 @@ partition create_sdpa_graph_partition(
"matmul_qk"};
matmul_qk.set_attr<bool>(op::attr::transpose_b, true);
logical_tensor scaled_qk_out{lt_id++, data_type::f32};
logical_tensor scaled_qk_out{lt_id++, sdpa_intermediate_dtype};
op scale_mul{
op_id++,
op::kind::Multiply,
@ -214,7 +227,7 @@ partition create_sdpa_graph_partition(
if (params.attn_mask.has_value()) {
TORCH_INTERNAL_ASSERT(
!is_causal, "Additive mask cannot use with is_causal.");
masked_qk_out = {lt_id++, data_type::f32};
masked_qk_out = {lt_id++, sdpa_intermediate_dtype};
mask_add = {
op_id++,
op::kind::Add,
@ -249,7 +262,7 @@ partition create_sdpa_graph_partition(
{mask_gt_out.value()},
"mask_gt"};
masked_qk_out = {lt_id++, data_type::f32};
masked_qk_out = {lt_id++, sdpa_intermediate_dtype};
mask_select = {
op_id++,
op::kind::Select,
@ -270,12 +283,15 @@ partition create_sdpa_graph_partition(
logical_tensor softmax_out{lt_id++, dtype};
softmax.add_input(masked_qk_out.value_or(scaled_qk_out));
softmax.add_output(softmax_out);
if (compute_logsumexp) {
softmax.add_output(params.logsumexp.value());
}
op matmul_v{
op_id++,
op::kind::MatMul,
{softmax_out, params.value},
{params.output},
{params.attention},
"matmul_v"};
constexpr auto ekind = dnnl::engine::kind::gpu;
@ -304,23 +320,446 @@ partition create_sdpa_graph_partition(
partition& find_or_create_graph_partition(
bool is_causal,
bool compute_logsumexp,
const SDPALogicalParams& params) {
thread_local static PartitionCache cache;
thread_local PartitionCache cache;
const data_type dtype = params.query.get_data_type();
// cache key creation
// patternID is determined on the basis of the arguments provided
std::bitset<32> patternID;
if (dtype == data_type::f32) {
// bit 3 corresponds to float32 dtype
patternID.set(3, 1);
patternID.set(static_cast<uint8_t>(PartitionCache::BitType::Float32), 1);
}
if (dtype == data_type::bf16) {
// bit 2 corresponds to fp16/bf16 dtype
patternID.set(2, 1);
patternID.set(static_cast<uint8_t>(PartitionCache::BitType::Bfloat16), 1);
}
// sdp pattern
patternID.set(4, 1);
patternID.set(static_cast<uint8_t>(PartitionCache::BitType::SdpaPattern), 1);
// Refer to comments in Utils.h. The first 8 bits are reserved
int pos = 8;
// attn_mask
patternID.set(pos++, params.attn_mask.has_value());
patternID.set(pos++, is_causal);
// compute_logsumexp
patternID.set(pos++, compute_logsumexp);
auto partition_ = cache.find_partition(patternID);
if (!partition_.has_value()) {
// partition cache no hit
// graph building and partitioning
partition sdp_partition = create_sdpa_graph_partition(
is_causal, compute_logsumexp, dtype, params);
partition_ = cache.insert_partition_cache(patternID, sdp_partition);
}
return *partition_;
}
} // namespace sdpa_forward
namespace sdpa_backward {
struct SDPABackwardLogicalParams {
enum class TensorID {
grad_out,
query,
key,
value,
out,
logsumexp,
scale,
neg_inf,
attn_mask,
grad_query,
grad_key,
grad_value,
end,
};
logical_tensor grad_out{};
logical_tensor query{};
logical_tensor key{};
logical_tensor value{};
logical_tensor out{};
logical_tensor logsumexp{};
logical_tensor scale{};
std::optional<logical_tensor> neg_inf;
std::optional<logical_tensor> attn_mask;
logical_tensor grad_query{};
logical_tensor grad_key{};
logical_tensor grad_value{};
SDPABackwardLogicalParams(
const at::Tensor& grad_out_,
const at::Tensor& query_,
const at::Tensor& key_,
const at::Tensor& value_,
const at::Tensor& out_,
const at::Tensor& logsumexp_,
const std::optional<at::Tensor>& attn_mask_,
const at::Tensor& grad_query_,
const at::Tensor& grad_key_,
const at::Tensor& grad_value_,
int batch_size,
int num_head_q,
int num_head_kv,
int seq_len_q,
int seq_len_kv,
int head_dim_qk,
int head_dim_v,
bool is_causal) {
const data_type dtype = to_logical_tensor_data_type(query_.scalar_type());
TORCH_INTERNAL_ASSERT(
(dtype != data_type::undef),
"Only FP16/BF16/FP32 datatypes are currently supported");
TORCH_INTERNAL_ASSERT(
grad_out_.scalar_type() == query_.scalar_type() &&
grad_out_.scalar_type() == key_.scalar_type() &&
grad_out_.scalar_type() == value_.scalar_type() &&
grad_out_.scalar_type() == out_.scalar_type(),
"scaled_dot_product_attention_backward_xpu: Expected grad_out, q, k, v and out to have the same data type, but got ",
" grad_out: ",
grad_out_.scalar_type(),
", q: ",
query_.scalar_type(),
", k: ",
key_.scalar_type(),
", v: ",
value_.scalar_type(),
", out: ",
out_.scalar_type());
TORCH_INTERNAL_ASSERT(
logsumexp_.defined() && logsumexp_.scalar_type() == at::kFloat,
"scaled_dot_product_attention_backward_xpu: Expected logsumexp to be defined and have FP32 data type");
const dims scalar_shape = {1};
at::Tensor reshaped_grad_out = grad_out_;
at::Tensor reshaped_query = query_;
at::Tensor reshaped_key = key_;
at::Tensor reshaped_value = value_;
at::Tensor reshaped_out = out_;
at::Tensor reshaped_logsumexp = logsumexp_.unsqueeze(-1);
at::Tensor reshaped_attn_mask = attn_mask_.value_or(at::Tensor());
at::Tensor reshaped_grad_query = grad_query_;
at::Tensor reshaped_grad_key = grad_key_;
at::Tensor reshaped_grad_value = grad_value_;
// handle broadcasted input tensors for OneDNN
if (at::native::onednn::is_broadcast(reshaped_grad_out)) {
at::native::onednn::undo_broadcast(reshaped_grad_out);
}
if (at::native::onednn::is_broadcast(reshaped_query)) {
at::native::onednn::undo_broadcast(reshaped_query);
}
if (at::native::onednn::is_broadcast(reshaped_key)) {
at::native::onednn::undo_broadcast(reshaped_key);
}
if (at::native::onednn::is_broadcast(reshaped_value)) {
at::native::onednn::undo_broadcast(reshaped_value);
}
if (attn_mask_.has_value() &&
at::native::onednn::is_broadcast(reshaped_attn_mask)) {
at::native::onednn::undo_broadcast(reshaped_attn_mask);
}
// TODO: Support GQA in backward pass once OneDNN supports it.
#define LOGIC_TENSOR_DESC(name, dtype) \
name = { \
static_cast<size_t>(TensorID::name), \
dtype, \
reshaped_##name.sizes().vec(), \
reshaped_##name.strides().vec()}
LOGIC_TENSOR_DESC(grad_out, dtype);
LOGIC_TENSOR_DESC(query, dtype);
LOGIC_TENSOR_DESC(key, dtype);
LOGIC_TENSOR_DESC(value, dtype);
LOGIC_TENSOR_DESC(out, dtype);
LOGIC_TENSOR_DESC(logsumexp, sdpa_intermediate_dtype);
scale = {
static_cast<size_t>(TensorID::scale),
to_logical_tensor_data_type(at::toOpMathType(query_.scalar_type())),
scalar_shape,
logical_tensor::layout_type::strided,
logical_tensor::property_type::constant};
if (is_causal) {
neg_inf = {
static_cast<size_t>(TensorID::neg_inf),
to_logical_tensor_data_type(at::toOpMathType(query_.scalar_type())),
scalar_shape,
logical_tensor::layout_type::strided,
logical_tensor::property_type::constant};
}
if (attn_mask_.has_value()) {
const data_type mask_dtype =
to_logical_tensor_data_type(attn_mask_->scalar_type());
TORCH_INTERNAL_ASSERT(
(mask_dtype != data_type::undef),
"Only FP16/BF16/FP32 datatypes are currently supported for attn_mask");
LOGIC_TENSOR_DESC(attn_mask, mask_dtype);
}
LOGIC_TENSOR_DESC(grad_query, dtype);
LOGIC_TENSOR_DESC(grad_key, dtype);
LOGIC_TENSOR_DESC(grad_value, dtype);
#undef LOGIC_TENSOR_DESC
}
std::vector<logical_tensor> get_input() const {
std::vector<logical_tensor> input = {
grad_out, query, key, value, out, logsumexp, scale};
if (neg_inf.has_value()) {
input.push_back(neg_inf.value());
}
if (attn_mask.has_value()) {
input.push_back(attn_mask.value());
}
return input;
}
std::vector<logical_tensor> get_output() const {
std::vector<logical_tensor> output = {grad_query, grad_key, grad_value};
return output;
}
};
partition create_sdpa_backward_graph_partition(
bool is_causal,
data_type dtype,
const SDPABackwardLogicalParams& params) {
// graph building and partitioning
size_t lt_id = static_cast<size_t>(SDPABackwardLogicalParams::TensorID::end);
size_t op_id = 0;
// OneDNN graph has optimized implementation for `f16` or `bf16` SDPA with
// `f32` intermediate data type on Intel Graphics Products with Intel(R) Xe
// Matrix Extensions (Intel(R) XMX) support, which means the
// Q/K/V tensors have bf16 or f16 data type while the output of the first
// MatMul, Scale, Mask, and the input of SoftMax are in f32 data type.
logical_tensor matmul_qk_out{lt_id++, sdpa_intermediate_dtype};
op matmul_qk{
op_id++,
op::kind::MatMul,
{params.query, params.key},
{matmul_qk_out},
"matmul_qk"};
matmul_qk.set_attr<bool>(op::attr::transpose_b, true);
logical_tensor scaled_qk_out{lt_id++, sdpa_intermediate_dtype};
op scale_mul{
op_id++,
op::kind::Multiply,
{matmul_qk_out, params.scale},
{scaled_qk_out},
"scale_mul"};
std::optional<logical_tensor> masked_qk_out;
// For optional additive mask
std::optional<op> mask_add;
// For optional implicite causal mask
std::optional<op> mask_gen_idx_row;
std::optional<logical_tensor> mask_row_idx;
std::optional<op> mask_gen_idx_col;
std::optional<logical_tensor> mask_col_idx;
std::optional<op> mask_gt;
std::optional<logical_tensor> mask_gt_out;
std::optional<op> mask_select;
if (params.attn_mask.has_value()) {
TORCH_INTERNAL_ASSERT(
!is_causal, "Additive mask cannot use with is_causal.");
masked_qk_out = {lt_id++, sdpa_intermediate_dtype};
mask_add = {
op_id++,
op::kind::Add,
{scaled_qk_out, params.attn_mask.value()},
{masked_qk_out.value()},
"mask_add"};
} else if (is_causal) {
mask_row_idx = {lt_id++, data_type::s32};
mask_gen_idx_row = {
op_id++,
op::kind::GenIndex,
{scaled_qk_out},
{mask_row_idx.value()},
"mask_gen_idx_row"};
mask_gen_idx_row->set_attr<int64_t>(op::attr::axis, -2);
mask_col_idx = {lt_id++, data_type::s32};
mask_gen_idx_col = {
op_id++,
op::kind::GenIndex,
{scaled_qk_out},
{mask_col_idx.value()},
"mask_gen_idx_col"};
mask_gen_idx_col->set_attr<int64_t>(op::attr::axis, -1);
mask_gt_out = {lt_id++, data_type::boolean};
mask_gt = {
op_id++,
op::kind::GreaterEqual,
{mask_row_idx.value(), mask_col_idx.value()},
{mask_gt_out.value()},
"mask_gt"};
masked_qk_out = {lt_id++, sdpa_intermediate_dtype};
mask_select = {
op_id++,
op::kind::Select,
{mask_gt_out.value(), scaled_qk_out, params.neg_inf.value()},
{masked_qk_out.value()},
"mask_select"};
}
// attention_probs = softmax(masked_score) = exp(masked_score - logsumexp)
logical_tensor sub_out{lt_id++, sdpa_intermediate_dtype};
op subtract{
op_id++,
op::kind::Subtract,
{masked_qk_out.value_or(scaled_qk_out), params.logsumexp},
{sub_out},
"subtract"};
logical_tensor prob{lt_id++, sdpa_intermediate_dtype};
op exp{op_id++, op::kind::Exp, {sub_out}, {prob}, "exp"};
// The following matmul doesn't support different input dtypes, insert a
// typecast
logical_tensor prob_casted = prob;
op typecast = op(op_id++, op::kind::TypeCast, "typecast");
if (dtype != sdpa_intermediate_dtype) {
prob_casted = logical_tensor(lt_id++, dtype);
typecast.add_inputs({prob});
typecast.add_outputs({prob_casted});
}
// grad_value = prob^T * grad_out
// TODO: handle GQA headnum because (batch_size, num_head_kv, seq_len_kv,
// head_dim_v) != (batch_size, num_head_q, seqlen_kv, seq_len_q) *
// (batch_size, num_head_q, seqlen_q, head_dim_v)
op matmul_grad_value{
op_id++,
op::kind::MatMul,
{prob_casted, params.grad_out},
{params.grad_value},
"matmul_grad_value"};
matmul_grad_value.set_attr<bool>(op::attr::transpose_a, true);
// grad_prop = grad_out * value^T
// TODO: handle GQA headnum because (batch_size, num_head_q, seq_len_q,
// seq_len_kv) != (batch_size, num_head_q, seq_len_q, head_dim_v) *
// (batch_size, num_head_kv, head_dim_v, seq_len_kv)
logical_tensor grad_prop{lt_id++, sdpa_intermediate_dtype};
op matmul_grad_prop{
op_id++,
op::kind::MatMul,
{params.grad_out, params.value},
{grad_prop},
"matmul_grad_prop"};
matmul_grad_prop.set_attr<bool>(op::attr::transpose_b, true);
// grad_masked_score = softmaxbackward(grad_prop)
logical_tensor grad_masked_score{lt_id++, sdpa_intermediate_dtype};
op softmax_backward{
op_id++,
op::kind::SoftMaxBackward,
{grad_prop, prob},
{grad_masked_score},
"softmax_backward"};
softmax_backward.set_attr<int64_t>(op::attr::axis, -1);
// TODO: add output tensor grad_attn_mask = grad_masked_score once OneDNN
// supports output grad_attn_mask.
// grad_scaled_score = grad_masked_score * scale
logical_tensor grad_scaled_score{lt_id++, sdpa_intermediate_dtype};
op grad_scale_mul{
op_id++,
op::kind::Multiply,
{grad_masked_score, params.scale},
{grad_scaled_score},
"grad_scale_mul"};
// The following matmul doesn't support different input dtypes, insert a
// typecast
logical_tensor grad_scaled_score_cast = grad_scaled_score;
op typecast2 = op(op_id++, op::kind::TypeCast, "typecast2");
if (dtype != sdpa_intermediate_dtype) {
grad_scaled_score_cast = logical_tensor(lt_id++, dtype);
typecast2.add_inputs({grad_scaled_score});
typecast2.add_outputs({grad_scaled_score_cast});
}
// grad_query = grad_scaled_score_cast * key
// TODO: handle GQA headnum because (batch_size, num_head_q, seq_len_q,
// head_dim_qk) != (batch_size, num_head_q, seq_len_q, seq_len_kv) *
// (batch_size, num_head_kv, seq_len_kv, head_dim_qk)
op matmul_grad_query{
op_id++,
op::kind::MatMul,
{grad_scaled_score_cast, params.key},
{params.grad_query},
"matmul_grad_query"};
// grad_key = grad_scaled_score_cast^T * query
op matmul_grad_key{
op_id++,
op::kind::MatMul,
{grad_scaled_score_cast, params.query},
{params.grad_key},
"matmul_grad_key"};
matmul_grad_key.set_attr<bool>(op::attr::transpose_a, true);
constexpr auto ekind = dnnl::engine::kind::gpu;
dnnl::graph::graph g(ekind);
g.add_op(matmul_qk);
g.add_op(scale_mul);
if (mask_add.has_value()) {
g.add_op(mask_add.value());
}
if (is_causal) {
g.add_op(mask_gen_idx_row.value());
g.add_op(mask_gen_idx_col.value());
g.add_op(mask_gt.value());
g.add_op(mask_select.value());
}
g.add_op(subtract);
g.add_op(exp);
g.add_op(matmul_grad_value);
g.add_op(matmul_grad_prop);
g.add_op(softmax_backward);
g.add_op(grad_scale_mul);
g.add_op(matmul_grad_query);
g.add_op(matmul_grad_key);
if (dtype != sdpa_intermediate_dtype) {
g.add_op(typecast);
g.add_op(typecast2);
}
g.finalize();
auto partitions = g.get_partitions();
TORCH_INTERNAL_ASSERT(
(partitions.size() == 1) && partitions[0].is_supported(),
"oneDNN doesn't support this fusion pattern. If you'd like its support, please submit a issue.");
return partitions[0];
}
partition& find_or_create_backward_graph_partition(
bool is_causal,
const SDPABackwardLogicalParams& params) {
thread_local PartitionCache cache;
const data_type dtype = params.query.get_data_type();
// cache key creation
// patternID is determined on the basis of the arguments provided
std::bitset<32> patternID;
if (dtype == data_type::f32) {
patternID.set(static_cast<uint8_t>(PartitionCache::BitType::Float32), 1);
}
if (dtype == data_type::bf16) {
patternID.set(static_cast<uint8_t>(PartitionCache::BitType::Bfloat16), 1);
}
// sdpa backward pattern
patternID.set(
static_cast<uint8_t>(PartitionCache::BitType::SdpaBwdPattern), 1);
// Refer to comments in Utils.h. The first 8 bits are reserved
int pos = 8;
@ -332,16 +771,18 @@ partition& find_or_create_graph_partition(
if (!partition_.has_value()) {
// partition cache no hit
// graph building and partitioning
partition sdp_partition =
create_sdpa_graph_partition(is_causal, dtype, params);
partition_ = cache.insert_partition_cache(patternID, sdp_partition);
partition sdpa_backward_partition =
create_sdpa_backward_graph_partition(is_causal, dtype, params);
partition_ =
cache.insert_partition_cache(patternID, sdpa_backward_partition);
}
return *partition_;
}
} // namespace sdpa_backward
} // namespace
namespace at::native::onednn {
void gpu_float_sdpa(
void sdpa(
int batch_size,
int seq_len_q,
int seq_len_kv,
@ -355,7 +796,9 @@ void gpu_float_sdpa(
std::optional<at::Tensor> attn_mask,
bool is_causal,
float softmax_scale,
const Tensor& output) {
const Tensor& attention,
bool compute_logsumexp,
const Tensor& logsumexp) {
auto& eng = GpuEngineManager::Instance().get_engine();
auto& strm = GpuStreamManager::Instance().get_stream();
@ -370,8 +813,8 @@ void gpu_float_sdpa(
};
// OneDNN doesn't support fp32 ukernel for implicit causal mask,
// and the reference implementation is worse than aten math + explict causal
// mask. Fall back to explict causal mask until OneDNN v3.9 which has fp32
// and the reference implementation is worse than aten math + explicit causal
// mask. Fall back to explicit causal mask until OneDNN v3.9 which has fp32
// ukernel for implicit causal mask.
if (is_causal && query.dtype() == at::kFloat) {
attn_mask = get_tril_mask();
@ -381,32 +824,27 @@ void gpu_float_sdpa(
std::vector<dnnl::graph::logical_tensor> l_inputs, l_outputs;
std::optional<dnnl::graph::compiled_partition> compiled_partition;
auto get_compiled_partition = [&]() {
const SDPALogicalParams logical_params(
query,
key,
value,
attn_mask,
output,
batch_size,
seq_len_q,
seq_len_kv,
num_head_q,
num_head_kv,
head_dim_qk,
head_dim_v,
is_causal);
auto& partition_ =
find_or_create_graph_partition(is_causal, logical_params);
auto i = logical_params.get_input();
auto o = logical_params.get_output();
auto compiled_partition = partition_.compile(i, o, eng);
l_inputs = std::move(i);
l_outputs = std::move(o);
return compiled_partition;
};
compiled_partition = get_compiled_partition();
const sdpa_forward::SDPALogicalParams logical_params(
query,
key,
value,
attn_mask,
attention,
logsumexp,
batch_size,
seq_len_q,
seq_len_kv,
num_head_q,
num_head_kv,
head_dim_qk,
head_dim_v,
is_causal,
compute_logsumexp);
auto& partition = sdpa_forward::find_or_create_graph_partition(
is_causal, compute_logsumexp, logical_params);
l_inputs = std::move(logical_params.get_input());
l_outputs = std::move(logical_params.get_output());
compiled_partition = partition.compile(l_inputs, l_outputs, eng);
Tensor softmax_scale1 = at::full(
{},
@ -416,26 +854,147 @@ void gpu_float_sdpa(
if (is_causal) {
neg_inf = at::full(
{},
-INFINITY,
-std::numeric_limits<float>::infinity(),
query.options().dtype(at::toOpMathType(query.scalar_type())));
}
std::vector<dnnl::graph::tensor> outputs = {
{l_outputs[0], eng, output.data_ptr()},
{l_outputs[0], eng, attention.data_ptr()},
};
if (compute_logsumexp) {
outputs.emplace_back(l_outputs[1], eng, logsumexp.data_ptr());
}
size_t i = 0;
std::vector<dnnl::graph::tensor> inputs;
inputs.reserve(l_inputs.size());
inputs.emplace_back(l_inputs[i++], eng, query.data_ptr());
inputs.emplace_back(l_inputs[i++], eng, key.data_ptr());
inputs.emplace_back(l_inputs[i++], eng, softmax_scale1.data_ptr());
#define ADD_INPUT(variable) \
inputs.emplace_back(l_inputs[i++], eng, variable.data_ptr())
ADD_INPUT(query);
ADD_INPUT(key);
ADD_INPUT(softmax_scale1);
if (neg_inf.has_value()) {
inputs.emplace_back(l_inputs[i++], eng, neg_inf->data_ptr());
ADD_INPUT((*neg_inf));
}
if (attn_mask.has_value()) {
inputs.emplace_back(l_inputs[i++], eng, attn_mask->data_ptr());
ADD_INPUT((*attn_mask));
}
inputs.emplace_back(l_inputs[i++], eng, value.data_ptr());
ADD_INPUT(value);
#undef ADD_INPUT
compiled_partition->execute(strm, inputs, outputs);
}
void sdpa_backward(
int batch_size,
int num_head_q,
int num_head_kv,
int seq_len_q,
int seq_len_kv,
int head_dim_qk,
int head_dim_v,
const Tensor& grad_out,
const Tensor& query,
const Tensor& key,
const Tensor& value,
const Tensor& out,
const Tensor& logsumexp,
std::optional<at::Tensor> attn_mask,
bool is_causal,
double scale,
Tensor& grad_query,
Tensor& grad_key,
Tensor& grad_value) {
auto& eng = GpuEngineManager::Instance().get_engine();
auto& strm = GpuStreamManager::Instance().get_stream();
const auto get_tril_mask = [&]() {
auto opts = query.options();
auto bool_tril =
at::ones_symint({seq_len_q, seq_len_kv}, opts.dtype(at::kBool)).tril();
return at::where(
bool_tril,
0.f,
at::scalar_tensor(-std::numeric_limits<float>::infinity(), opts));
};
// OneDNN doesn't support fp32 ukernel for implicit causal mask,
// and the reference implementation is worse than aten math + explicit causal
// mask. Fall back to explicit causal mask until OneDNN v3.9 which has fp32
// ukernel for implicit causal mask.
if (is_causal && query.dtype() == at::kFloat) {
attn_mask = get_tril_mask();
is_causal = false;
}
std::vector<dnnl::graph::logical_tensor> l_inputs, l_outputs;
std::optional<dnnl::graph::compiled_partition> compiled_partition;
const sdpa_backward::SDPABackwardLogicalParams logical_params(
grad_out,
query,
key,
value,
out,
logsumexp,
attn_mask,
grad_query,
grad_key,
grad_value,
batch_size,
num_head_q,
num_head_kv,
seq_len_q,
seq_len_kv,
head_dim_qk,
head_dim_v,
is_causal);
auto& partition = sdpa_backward::find_or_create_backward_graph_partition(
is_causal, logical_params);
l_inputs = std::move(logical_params.get_input());
l_outputs = std::move(logical_params.get_output());
compiled_partition = partition.compile(l_inputs, l_outputs, eng);
Tensor softmax_scale = at::full(
{}, scale, query.options().dtype(at::toOpMathType(query.scalar_type())));
std::optional<at::Tensor> neg_inf;
if (is_causal) {
neg_inf = at::full(
{},
-std::numeric_limits<float>::infinity(),
query.options().dtype(at::toOpMathType(query.scalar_type())));
}
std::vector<dnnl::graph::tensor> outputs = {
{l_outputs[0], eng, grad_query.data_ptr()},
{l_outputs[1], eng, grad_key.data_ptr()},
{l_outputs[2], eng, grad_value.data_ptr()},
};
size_t i = 0;
std::vector<dnnl::graph::tensor> inputs;
inputs.reserve(l_inputs.size());
#define ADD_INPUT(variable) \
inputs.emplace_back(l_inputs[i++], eng, variable.data_ptr())
ADD_INPUT(grad_out);
ADD_INPUT(query);
ADD_INPUT(key);
ADD_INPUT(value);
ADD_INPUT(out);
ADD_INPUT(logsumexp);
ADD_INPUT(softmax_scale);
if (neg_inf.has_value()) {
ADD_INPUT((*neg_inf));
}
if (attn_mask.has_value()) {
ADD_INPUT((*attn_mask));
}
#undef ADD_INPUT
compiled_partition->execute(strm, inputs, outputs);
}
} // namespace at::native::onednn

View File

@ -110,11 +110,21 @@ struct PartitionCache {
// bit 1: is uint8
// bit 2: fp16(0) / bf16(1)
// bit 3: is fp32
// bit 4: is sdp pattern
// bit 5-7: N/A
// bit 4: is sdpa pattern
// bit 5: is sdpa backward pattern
// bit 6-7: reserved for future use
// The rest of the bits depend upon the arguments provided
// However, down the line, we might have different bitsets for different
// patterns
enum class BitType : uint8_t {
Int8 = 0,
Uint8 = 1,
Bfloat16 = 2,
Float32 = 3,
SdpaPattern = 4,
SdpaBwdPattern = 5
};
dnnl::graph::partition& insert_partition_cache(
std::bitset<32>& patternID,
dnnl::graph::partition& p) {

View File

@ -164,7 +164,7 @@ void quantized_matmul(
std::string_view unary_post_op_algorithm,
bool m2_trnas);
void gpu_float_sdpa(
void sdpa(
int batch_size,
int seq_len_q,
int seq_len_kv,
@ -178,5 +178,28 @@ void gpu_float_sdpa(
std::optional<at::Tensor> attn_mask,
bool is_causal,
float softmax_scale,
const Tensor& output);
const Tensor& attention,
bool compute_logsumexp,
const Tensor& logsumexp);
void sdpa_backward(
int batch_size,
int num_head_q,
int num_head_kv,
int seq_len_q,
int seq_len_kv,
int head_dim_qk,
int head_dim_v,
const Tensor& grad_out,
const Tensor& query,
const Tensor& key,
const Tensor& value,
const Tensor& out,
const Tensor& logsumexp,
std::optional<at::Tensor> attn_mask,
bool is_causal,
double scale,
Tensor& grad_query,
Tensor& grad_key,
Tensor& grad_value);
} // namespace at::native::onednn

View File

@ -1,48 +0,0 @@
#pragma once
#include <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>
#if !defined(__MAC_14_0) && (!defined(MAC_OS_X_VERSION_14_0) || (MAC_OS_X_VERSION_MIN_REQUIRED < MAC_OS_X_VERSION_14_0))
typedef NS_ENUM(NSUInteger, MPSGraphFFTScalingMode) {
MPSGraphFFTScalingModeNone = 0L,
MPSGraphFFTScalingModeSize = 1L,
MPSGraphFFTScalingModeUnitary = 2L,
};
@interface FakeMPSGraphFFTDescriptor : NSObject<NSCopying>
@property(readwrite, nonatomic) BOOL inverse;
@property(readwrite, nonatomic) MPSGraphFFTScalingMode scalingMode;
@property(readwrite, nonatomic) BOOL roundToOddHermitean;
+ (nullable instancetype)descriptor;
@end
@compatibility_alias MPSGraphFFTDescriptor FakeMPSGraphFFTDescriptor;
@interface MPSGraph (SonomaOps)
- (MPSGraphTensor* _Nonnull)conjugateWithTensor:(MPSGraphTensor* _Nonnull)tensor name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)realPartOfTensor:(MPSGraphTensor* _Nonnull)tensor name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)fastFourierTransformWithTensor:(MPSGraphTensor* _Nonnull)tensor
axes:(NSArray<NSNumber*>* _Nonnull)axes
descriptor:(MPSGraphFFTDescriptor* _Nonnull)descriptor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)realToHermiteanFFTWithTensor:(MPSGraphTensor* _Nonnull)tensor
axes:(NSArray<NSNumber*>* _Nonnull)axes
descriptor:(MPSGraphFFTDescriptor* _Nonnull)descriptor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)HermiteanToRealFFTWithTensor:(MPSGraphTensor* _Nonnull)tensor
axes:(NSArray<NSNumber*>* _Nonnull)axes
descriptor:(MPSGraphFFTDescriptor* _Nonnull)descriptor
name:(NSString* _Nullable)name;
@end
// define BFloat16 enums for MacOS13
#define MPSDataTypeBFloat16 ((MPSDataType)(MPSDataTypeAlternateEncodingBit | MPSDataTypeFloat16))
// define Metal version
#define MTLLanguageVersion3_1 ((MTLLanguageVersion)((3 << 16) + 1))
#endif

View File

@ -1,196 +0,0 @@
#pragma once
#include <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>
// TODO: Remove me when moved to MacOS 13
#if !defined(__MAC_13_2) && (!defined(MAC_OS_X_VERSION_13_2) || (MAC_OS_X_VERSION_MIN_REQUIRED < MAC_OS_X_VERSION_13_2))
@interface FakeMPSGraphConvolution3DOpDescriptor : NSObject<NSCopying>
@property(readwrite, nonatomic) NSUInteger strideInX;
@property(readwrite, nonatomic) NSUInteger strideInY;
@property(readwrite, nonatomic) NSUInteger strideInZ;
@property(readwrite, nonatomic) NSUInteger dilationRateInX;
@property(readwrite, nonatomic) NSUInteger dilationRateInY;
@property(readwrite, nonatomic) NSUInteger dilationRateInZ;
@property(readwrite, nonatomic) NSUInteger paddingLeft;
@property(readwrite, nonatomic) NSUInteger paddingRight;
@property(readwrite, nonatomic) NSUInteger paddingTop;
@property(readwrite, nonatomic) NSUInteger paddingBottom;
@property(readwrite, nonatomic) NSUInteger paddingFront;
@property(readwrite, nonatomic) NSUInteger paddingBack;
@property(readwrite, nonatomic) MPSGraphPaddingStyle paddingStyle;
@property(readwrite, nonatomic) MPSGraphTensorNamedDataLayout dataLayout;
@property(readwrite, nonatomic) MPSGraphTensorNamedDataLayout weightsLayout;
@property(readwrite, nonatomic) NSUInteger groups;
@end
@compatibility_alias MPSGraphConvolution3DOpDescriptor FakeMPSGraphConvolution3DOpDescriptor;
#endif
@interface MPSGraph (VenturaOps)
#if !defined(__MAC_13_0) && (!defined(MAC_OS_X_VERSION_13_0) || (MAC_OS_X_VERSION_MIN_REQUIRED < MAC_OS_X_VERSION_13_0))
typedef NS_ENUM(NSUInteger, MPSGraphResizeNearestRoundingMode) {
MPSGraphResizeNearestRoundingModeRoundPreferCeil = 0L,
MPSGraphResizeNearestRoundingModeRoundPreferFloor = 1L,
MPSGraphResizeNearestRoundingModeCeil = 2L,
MPSGraphResizeNearestRoundingModeFloor = 3L,
MPSGraphResizeNearestRoundingModeRoundToEven = 4L,
MPSGraphResizeNearestRoundingModeRoundToOdd = 5L,
};
// Define complex enums for MacOS 12
#define MPSDataTypeComplexBit 0x01000000
#define MPSDataTypeComplexFloat32 ((MPSDataType)(MPSDataTypeFloatBit | MPSDataTypeComplexBit | 64))
#define MPSDataTypeComplexFloat16 ((MPSDataType)(MPSDataTypeFloatBit | MPSDataTypeComplexBit | 32))
#endif
- (MPSGraphTensor* _Nonnull)convolution3DWithSourceTensor:(MPSGraphTensor* _Nonnull)source
weightsTensor:(MPSGraphTensor* _Nonnull)weights
descriptor:(MPSGraphConvolution3DOpDescriptor* _Nonnull)descriptor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)
convolution3DDataGradientWithIncomingGradientTensor:(MPSGraphTensor* _Nonnull)incomingGradient
weightsTensor:(MPSGraphTensor* _Nonnull)weights
outputShape:(MPSShape* _Nonnull)outputShape
forwardConvolutionDescriptor:
(MPSGraphConvolution3DOpDescriptor* _Nonnull)forwardConvolutionDescriptor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)
convolution3DWeightsGradientWithIncomingGradientTensor:(MPSGraphTensor* _Nonnull)incomingGradient
sourceTensor:(MPSGraphTensor* _Nonnull)source
outputShape:(MPSShape* _Nonnull)outputShape
forwardConvolutionDescriptor:
(MPSGraphConvolution3DOpDescriptor* _Nonnull)forwardConvolutionDescriptor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)cumulativeSumWithTensor:(MPSGraphTensor* _Nonnull)tensor
axis:(NSInteger)axis
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)sortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axis:(NSInteger)axis
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)sortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axis:(NSInteger)axis
descending:(BOOL)descending
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)sortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axisTensor:(MPSGraphTensor* _Nonnull)axisTensor
descending:(BOOL)descending
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)sortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axisTensor:(MPSGraphTensor* _Nonnull)axisTensor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)argSortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axis:(NSInteger)axis
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)argSortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axis:(NSInteger)axis
descending:(BOOL)descending
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)argSortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axisTensor:(MPSGraphTensor* _Nonnull)axisTensor
descending:(BOOL)descending
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)argSortWithTensor:(MPSGraphTensor* _Nonnull)tensor
axisTensor:(MPSGraphTensor* _Nonnull)axisTensor
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)inverseOfTensor:(MPSGraphTensor* _Nonnull)inputTensor name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeNearestWithTensor:(MPSGraphTensor* _Nonnull)imagesTensor
sizeTensor:(MPSGraphTensor* _Nonnull)size
nearestRoundingMode:(MPSGraphResizeNearestRoundingMode)nearestRoundingMode
centerResult:(BOOL)centerResult
alignCorners:(BOOL)alignCorners
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeNearestWithTensor:(MPSGraphTensor* _Nonnull)imagesTensor
sizeTensor:(MPSGraphTensor* _Nonnull)size
scaleOffsetTensor:(MPSGraphTensor* _Nonnull)scaleOffset
nearestRoundingMode:(MPSGraphResizeNearestRoundingMode)nearestRoundingMode
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeBilinearWithTensor:(MPSGraphTensor* _Nonnull)imagesTensor
sizeTensor:(MPSGraphTensor* _Nonnull)size
centerResult:(BOOL)centerResult
alignCorners:(BOOL)alignCorners
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeBilinearWithTensor:(MPSGraphTensor* _Nonnull)imagesTensor
sizeTensor:(MPSGraphTensor* _Nonnull)size
scaleOffsetTensor:(MPSGraphTensor* _Nonnull)scaleOffset
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeNearestWithGradientTensor:(MPSGraphTensor* _Nonnull)gradient
input:(MPSGraphTensor* _Nonnull)input
nearestRoundingMode:(MPSGraphResizeNearestRoundingMode)nearestRoundingMode
centerResult:(BOOL)centerResult
alignCorners:(BOOL)alignCorners
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeNearestWithGradientTensor:(MPSGraphTensor* _Nonnull)gradient
input:(MPSGraphTensor* _Nonnull)input
scaleOffsetTensor:(MPSGraphTensor* _Nonnull)scaleOffset
nearestRoundingMode:(MPSGraphResizeNearestRoundingMode)nearestRoundingMode
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeBilinearWithGradientTensor:(MPSGraphTensor* _Nonnull)gradient
input:(MPSGraphTensor* _Nonnull)input
centerResult:(BOOL)centerResult
alignCorners:(BOOL)alignCorners
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)resizeBilinearWithGradientTensor:(MPSGraphTensor* _Nonnull)gradient
input:(MPSGraphTensor* _Nonnull)input
scaleOffsetTensor:(MPSGraphTensor* _Nonnull)scaleOffset
layout:(MPSGraphTensorNamedDataLayout)layout
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)sampleGridWithSourceTensor:(MPSGraphTensor* _Nonnull)source
coordinateTensor:(MPSGraphTensor* _Nonnull)coordinates
layout:(MPSGraphTensorNamedDataLayout)layout
normalizeCoordinates:(BOOL)normalizeCoordinates
relativeCoordinates:(BOOL)relativeCoordinates
alignCorners:(BOOL)alignCorners
paddingMode:(MPSGraphPaddingMode)paddingMode
samplingMode:(MPSGraphResizeMode)samplingMode
constantValue:(double)constantValue
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)sampleGridWithSourceTensor:(MPSGraphTensor* _Nonnull)source
coordinateTensor:(MPSGraphTensor* _Nonnull)coordinates
layout:(MPSGraphTensorNamedDataLayout)layout
normalizeCoordinates:(BOOL)normalizeCoordinates
relativeCoordinates:(BOOL)relativeCoordinates
alignCorners:(BOOL)alignCorners
paddingMode:(MPSGraphPaddingMode)paddingMode
nearestRoundingMode:(MPSGraphResizeNearestRoundingMode)nearestRoundingMode
constantValue:(double)constantValue
name:(NSString* _Nullable)name;
- (MPSGraphTensor* _Nonnull)truncateWithTensor:(MPSGraphTensor* _Nonnull)tensor name:(NSString* _Nullable)name;
@end

View File

@ -9,8 +9,6 @@
#include <ATen/mps/MPSAllocatorInterface.h>
#include <ATen/mps/MPSProfiler.h>
#include <ATen/native/mps/MPSGraphSequoiaOps.h>
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#include <fmt/format.h>
#include <fmt/ranges.h>

View File

@ -39,6 +39,13 @@ struct lerp_alpha_functor {
}
};
struct native_dropout_mask_and_scale_functor {
template <typename TI, typename TA>
inline TA operator()(const TI a, const TI b, const TA scale) {
return static_cast<TA>(a) * static_cast<TA>(b) * scale;
}
};
struct fmax_functor {
template <typename T>
inline T operator()(const T a, const T b) {
@ -427,6 +434,10 @@ REGISTER_BINARY_ALPHA_OP(lerp_alpha, uchar, uchar, uchar);
REGISTER_BINARY_ALPHA_OP(lerp_alpha, char, char, char);
REGISTER_BINARY_ALPHA_OP(lerp_alpha, bool, bool, bool);
REGISTER_BINARY_ALPHA_OP(native_dropout_mask_and_scale, float, float, float);
REGISTER_BINARY_ALPHA_OP(native_dropout_mask_and_scale, bfloat, bfloat, bfloat);
REGISTER_BINARY_ALPHA_OP(native_dropout_mask_and_scale, half, half, half);
REGISTER_BINARY_ALPHA_OP(add_alpha, bfloat, bfloat, bfloat);
REGISTER_BINARY_ALPHA_OP(sub_alpha, bfloat, bfloat, bfloat);
REGISTER_BINARY_ALPHA_OP(lerp_alpha, bfloat, bfloat, bfloat);

View File

@ -8,8 +8,6 @@
#include <ATen/native/TensorIterator.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/native/mps/operations/BinaryKernel.h>
// For MTLLanguageVersion_3_1
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <fmt/format.h>
#ifndef AT_PER_OPERATOR_HEADERS
@ -168,6 +166,10 @@ static void lerp_scalar_mps_kernel(at::TensorIteratorBase& iter, const Scalar& w
lib.exec_binary_kernel(iter, "lerp_alpha", weight);
}
static void native_dropout_mask_and_scale_mps_kernel(at::TensorIteratorBase& iter, const Scalar& scale) {
lib.exec_binary_kernel(iter, "native_dropout_mask_and_scale", scale);
}
static void mul_mps_kernel(TensorIteratorBase& iter) {
lib.exec_binary_kernel(iter, "mul");
}

View File

@ -1,23 +1,12 @@
// Copyright © 2022 Apple Inc.
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/native/ConvUtils.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/ops/_mps_convolution_native.h>
#include <ATen/ops/_mps_convolution_transpose_native.h>
#include <ATen/ops/mps_convolution_backward_native.h>
#include <ATen/ops/mps_convolution_transpose_backward_native.h>
#if !defined(__MAC_13_2) && (!defined(MAC_OS_X_VERSION_13_2) || (MAC_OS_X_VERSION_MIN_REQUIRED < MAC_OS_X_VERSION_13_2))
@implementation FakeMPSGraphConvolution3DOpDescriptor
- (nonnull id)copyWithZone:(nullable NSZone*)zone {
return self;
}
@end
#endif
#include <fmt/format.h>
namespace at::native {
@ -50,11 +39,9 @@ static void fill_conv3d_desc(MPSGraphConvolution3DOpDescriptor* descriptor_,
descriptor_.paddingFront = paddingDepth;
descriptor_.paddingBack = paddingDepth;
// PyTorch always uses NCDHW memory layout for 3D tensors
descriptor_.dataLayout = (MPSGraphTensorNamedDataLayout)7L; // MPSGraphTensorNamedDataLayoutNCDHW;
descriptor_.dataLayout = MPSGraphTensorNamedDataLayoutNCDHW;
// PyTorch always uses OIDHW memory layout for 3D weights
descriptor_.weightsLayout = (MPSGraphTensorNamedDataLayout)9L; // MPSGraphTensorNamedDataLayoutOIDHW;
descriptor_.weightsLayout = MPSGraphTensorNamedDataLayoutOIDHW;
descriptor_.groups = groups; // not yet tested in Xcode/C++
}
@ -186,18 +173,6 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
if (bias_defined)
bias_shape = bias_opt.value().sizes();
std::string mem_format_key;
switch (memory_format) {
case at::MemoryFormat::Contiguous:
mem_format_key = "Contiguous";
break;
case at::MemoryFormat::ChannelsLast:
mem_format_key = "ChannelsLast";
break;
default:
assert(0 && "Check should have been done earlier\n");
}
std::string bias_shape_key;
if (bias_defined) {
bias_shape_key = std::to_string(bias_shape[0]);
@ -205,20 +180,16 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
bias_shape_key = "nobias";
}
std::string key;
if (is3DConv) {
key = "mps_3d_convolution:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(stride[2]) + ":" + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
mps::getTensorsStringKey({input_t, weight_t}) + ":" + std::to_string(bias_defined) + ":" + bias_shape_key;
} else {
key = "mps_convolution:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
mps::getTensorsStringKey({input_t, weight_t}) + ":" + std::to_string(bias_defined) + ":" + bias_shape_key;
}
std::string key = fmt::format("mps_{}convolution:{}:{}:{}:{}:{}:{}:{}:{}",
is3DConv ? "3d_" : "",
getArrayRefString(stride),
getArrayRefString(dilation),
getArrayRefString(padding),
groups,
is_channels_last,
mps::getTensorsStringKey({input_t, weight_t}),
bias_defined,
bias_shape_key);
MPSShape* inputShape = mps::getMPSShape(input_t, memory_format);
MPSShape* outputShape = mps::getMPSShape(output_t, memory_format);
@ -400,33 +371,15 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
@autoreleasepool {
MPSStream* stream = getCurrentMPSStream();
std::string mem_format_key;
switch (memory_format) {
case at::MemoryFormat::Contiguous:
mem_format_key = "Contiguous";
break;
case at::MemoryFormat::ChannelsLast:
mem_format_key = "ChannelsLast";
break;
default:
assert(0 && "Check should have been done earlier\n");
}
MPSShape* mps_input_shape = getMPSShape(input_size);
std::string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
":" + std::to_string(stride[2]) + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t});
} else {
key = "mps_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t});
}
std::string key = fmt::format("mps_{}_convolution_backward_input:{}:{}:{}:{}:{}:{}",
is3DConv ? "3d_" : "",
getArrayRefString(stride),
getArrayRefString(dilation),
getArrayRefString(padding),
groups,
is_channels_last,
getTensorsStringKey({grad_output_t, weight_t}));
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
auto gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output_t);
auto weightTensor = mpsGraphRankedPlaceHolder(mpsGraph, weight_t);
@ -551,19 +504,13 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
MPSStream* stream = getCurrentMPSStream();
MPSShape* mps_weight_shape = getMPSShape(weight_size);
std::string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_weights:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(stride[2]) + ":" + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" +
getTensorsStringKey({grad_output_t, input_t, grad_weight_t});
} else {
key = "mps_convolution_backward_weights:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" +
getTensorsStringKey({grad_output_t, input_t, grad_weight_t});
}
std::string key = fmt::format("mps_{}convolution_backward_weights:{}:{}:{}:{}:{}",
is3DConv ? "3d_" : "",
getArrayRefString(stride),
getArrayRefString(dilation),
getArrayRefString(padding),
groups,
getTensorsStringKey({grad_output_t, input_t, grad_weight_t}));
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSShape* inputShape = getMPSShape(input_t);
bool isDepthwiseConv =

View File

@ -2,7 +2,6 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/mps/MPSProfiler.h>
#include <ATen/native/mps/Copy.h>
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/ops/_copy_from_and_resize_native.h>
#include <ATen/ops/_copy_from_native.h>

View File

@ -5,8 +5,6 @@
#include <ATen/native/DistributionTemplates.h>
#include <ATen/native/Distributions.h>
#include <ATen/native/TensorFactories.h>
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#ifndef AT_PER_OPERATOR_HEADERS

View File

@ -0,0 +1,45 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/TensorOperators.h>
#include <ATen/mps/MPSGeneratorImpl.h>
#include <ATen/native/Distributions.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/native/mps/operations/BinaryKernel.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/bernoulli.h>
#include <ATen/ops/empty_like.h>
#include <ATen/ops/native_dropout_backward_native.h>
#include <ATen/ops/native_dropout_native.h>
#include <ATen/ops/ones_like.h>
#endif
namespace at::native {
static Tensor native_dropout_mask_and_scale(const Tensor& input, const Tensor& mask, float scale) {
auto output = at::empty_like(input);
mps::binary_op_kernel("native_dropout_mask_and_scale", input, mask, output, scale);
return output;
}
std::tuple<Tensor, Tensor> native_dropout_mps(const Tensor& input, double p, std::optional<bool> train) {
if (input.numel() == 0 || !train.value_or(false) || p == 0) {
return {input.clone(), at::ones_like(input, input.options().dtype(c10::kBool))};
}
float p_comp = 1.0f - p;
Tensor mask = at::empty_like(input, input.options().dtype(c10::kBool));
mask.bernoulli_(p_comp);
auto scale = p_comp == 0 ? 0.0f : 1.0f / p_comp;
Tensor output = native_dropout_mask_and_scale(input, mask, scale);
return {std::move(output), std::move(mask)};
}
Tensor native_dropout_backward_mps(const Tensor& grad, const Tensor& mask, double scale) {
auto grad_float = isFloatingType(grad.scalar_type()) ? grad : grad.to(c10::kFloat);
return native_dropout_mask_and_scale(grad_float, mask, scale);
}
} // namespace at::native

View File

@ -1,6 +1,4 @@
#include <ATen/native/SpectralOpsUtils.h>
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#ifndef AT_PER_OPERATOR_HEADERS
@ -12,20 +10,6 @@
#include <ATen/ops/_fft_r2c_native.h>
#endif
#if !defined(__MAC_14_0) && (!defined(MAC_OS_X_VERSION_14_0) || (MAC_OS_X_VERSION_MIN_REQUIRED < MAC_OS_X_VERSION_14_0))
@implementation FakeMPSGraphFFTDescriptor
+ (nullable instancetype)descriptor {
// Redispatch the constructor to the actual implementation
id desc = NSClassFromString(@"MPSGraphFFTDescriptor");
return (FakeMPSGraphFFTDescriptor*)[desc descriptor];
}
- (nonnull id)copyWithZone:(nullable NSZone*)zone {
return self;
}
@end
#endif
namespace at::native {
namespace {
MPSGraphFFTScalingMode normalization_to_ScalingMode(int64_t normalization) {

View File

@ -2,7 +2,6 @@
#include <ATen/mps/MPSProfiler.h>
#include <ATen/native/GridSamplerUtils.h>
#include <ATen/native/Pool.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#include <ATen/native/mps/kernels/GridSampler.h>

View File

@ -17,7 +17,6 @@
#include <ATen/native/LinearAlgebraUtils.h>
#include <ATen/native/Resize.h>
#include <ATen/native/TensorAdvancedIndexing.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <c10/util/SmallVector.h>
#include <c10/util/irange.h>
#include <fmt/format.h>

View File

@ -6,9 +6,7 @@
#include <ATen/native/LinearAlgebra.h>
#include <ATen/native/LinearAlgebraUtils.h>
#include <ATen/native/Resize.h>
// For MTLLanguageVersion_3_1
#include <ATen/native/mps/MPSGraphSequoiaOps.h>
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#ifndef AT_PER_OPERATOR_HEADERS

View File

@ -4,7 +4,6 @@
#include <ATen/TensorUtils.h>
#include <ATen/native/Pool.h>
#include <ATen/native/ReduceOpsUtils.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#include <c10/util/irange.h>
@ -617,6 +616,7 @@ static Tensor median_common_mps(const Tensor& input_t, bool nanmedian) {
// we allocate 1 here due to MacOS13 bug for gather MPSGraph op, look below for the error
Tensor output_t = at::empty({1}, input_t.scalar_type(), std::nullopt, kMPS, std::nullopt, std::nullopt);
if (output_t.numel() == 0 || num_in_elements == 0) {
output_t.fill_(std::numeric_limits<float>::quiet_NaN());
return output_t;
}

View File

@ -4,7 +4,6 @@
#include <ATen/WrapDimUtils.h>
#include <ATen/native/TensorShape.h>
#include <ATen/native/TypeProperties.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#ifndef AT_PER_OPERATOR_HEADERS

View File

@ -5,7 +5,6 @@
#include <ATen/native/SortingUtils.h>
#include <ATen/native/TensorShape.h>
#include <ATen/native/TypeProperties.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#ifndef AT_PER_OPERATOR_HEADERS

View File

@ -2,8 +2,6 @@
#define TORCH_ASSERT_ONLY_METHOD_OPERATORS
#include <ATen/native/UnaryOps.h>
#include <ATen/native/mps/Copy.h>
#include <ATen/native/mps/MPSGraphSonomaOps.h>
#include <ATen/native/mps/MPSGraphVenturaOps.h>
#include <ATen/native/mps/OperationUtils.h>
#ifndef AT_PER_OPERATOR_HEADERS

Some files were not shown because too many files have changed in this diff Show More